Last evening, I spent some time on reading few materials on Hadoop. I know, it has been a long time since I could some more work on Hadoop. In the past, I have taken HDInsight Preview edition from Microsoft or HortonWorks 1.1 for Windows. I wanted to install and play with latest and greatest of Hadoop in it’s raw form, directly from the Apache site. The bad news is that there is no installer for Hadoop for Windows and there was no virtual machine readily available for download. When you have to learn Hadoop, you don’t want to spend time learning CentOS Image may be NSFW.
Clik here to view. Anyway, you would need to build from scratch. I found this site: http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os very useful and it helped me save tons of time.
However, it was not flawless (you mileage may vary too). I had to struggle to bring pieces together.
My Environment
- Microsoft Windows Server 2008 R2 virtual machine mounted on Hyper-V
- Visual Studio 2012 and SQL Server 2012 including Microsoft Visual C++ 2010 Redistributable
What didn’t work for me?
I faced problems in creating native distribution (“Build Hadoop bin distribution for Windows” # step g).
- Installing Java in the default path of c:\program files\...
I installed JDK in the c:\Program Files\Java folder. However, after reading installation material at Apache site, I realized that you should have any space in the folder path. So I un-installed and re-installed Java in C:\Java. I also made sure that JAVA_HOME reflected the same.
- Struggling with installation of Win7 SDK
Apache installation guidelines also mentions that you must have Visual Studio 2010. I had Visual Studio 2012. After few tries, I searched and found a thread on StackOverflow.com (I love this site as a developer) that you must un-install existing versions of Microsoft Visual C++ Redistributable before installing Windows 7 SDK. After un-installing this redistributable, life was much easier Image may be NSFW.
Clik here to view.and Windows 7 SDK (also good for my OS i.e. Windows Server 2008 R2) installed flawlessly. On the side note, my Google search gave me many possible causes including registry permissions etc. but nothing worked till I hit the jackpot of un-installing existing C++ 2010 redistributable.
- Missing tools.jar file exception
I kept getting tools.jar missing file exception in the build. Based on Google searches, I added the following the pom.xml file of C:\hdfs\hadoop-dist directory.
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.7.0_45</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
- hadoop-annotations error (missing annotations)
This was a known bug and I edit pom.xml file in C:\hdfs\hadoop-common-project\hadoop-auth directory and added
<dependency>
<groupId>org.mortbay.jetty</groupId>
<artifactId>jetty-util</artifactId>
<scope>test</scope>
</dependency>
What did I do differently?
- Installed JDK 7 and not 6
- Changed order of the steps
I wanted to save time and be more efficient (subjective Image may be NSFW.
Clik here to view.). While build for native distribution hadoop-2.2.0.tar.gz (Build Hadoop bin distribution for Windows, step g) was still running, I went ahead with step “Configure Hadoop”.
- Created System Variables rather than User Variables
The blog mentions creating User Variables such as JAVA_HOME, M2_HOME and Platform. I created them at System level rather than user level. I also had to restart the system because I noticed that changes in the environment variables were not taking effect without it (still thinking about it).
- Copied files to bin directory after successful build
After fixing all issues related to native distribution build, I ran start-dfs and start-yarn commands but got exceptions that winutils.exe was missing. The native distribution was built in folder C:\hdfs\hadoop-dist\target\hadoop-2.2.0\bin and you would need to copy/paste/overwirte all files including winutils.exe to c:\hadoop\bin directory before your HDF/Yarn start commands work.
At the end of the night (almost 1:00AM), things finally worked and I could sleep. I would call this sleepless nightImage may be NSFW.
Clik here to view. in Atlanta. Now that I have a raw, full and latest Hadoop running, I will configure Eclipse and run MapReduce program in Java. I am also planning to learn Python quickly for Machine Learning (still debating Python vs. R). Ideas?