September 29, 2013

How To Setup Multi Node Hadoop 2.0.x(YARN) Cluster

This post describes necessary steps required to setup 2-node Hadoop YARN cluster using Hadoop 2.0.6-alpha release. This post is based on these¹ posts² and can be considered as a combination of both posts with extra steps necessary to setup Hadoop 2.0.6-alpha. Steps I discussed here should also work for Hadoop 2.1.0-beta.

User accounts, /etc/hosts file modifications and password less SSH #

You can follow the steps described in this post¹ under section “User creation and other configurations steps” to setup necessary user accounts and password less SSH. One thing to make sure that password less ssh for localhost, in addition to slaves. Other post² doesn’t mention about password less SSH for localhost, but it is important. Otherwise startup scripts will ask you to type the password during data node and node manager startup.

Configuring Hadoop #

You can follow the steps 4, 5, 6, 7, and 8 described in this post² to configure Hadoop with some small modifications noted below.

During step 5 in this post², you need to add JAVA_HOME environment variable to hadoop-env.sh too.
One of the most important configuration step is configuring yarn.nodemanager.address and yarn.nodemanager.localizer.address in yarn-site.xml during step 7 discussed in this post². We need this configuration change only in master node due to the fact that both resource manager and node manager will run on master node and if we don’t have a node manager address and localizer address specific to node manager, node manager will try to bind to same ports which uses by resource manager.

So with above mentioned change yarn-site.xml will look like following after necessary changes.

<?xml version="1.0"?>
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master:8025</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8040</value>
  </property>
  <property>
    <name>yarn.nodemanager.address</name>
    <value>master:8050</value>
  </property>
    <property>
    <name>yarn.nodemanager.localizer.address</name>
    <value>master:8060</value>
  </property>
</configuration>

Running Hadoop YARN and Checking Installation #

You can follow steps 9, 10, 11 and 12 in this post² to start and test the installation. You can find necessary information about web interface URLs and how to stop Hadoop YARN cluster in last part of the same post².

This post was moved from my old blog.

Kudos

How To Setup Multi Node Hadoop 2.0.x(YARN) Cluster

User accounts, /etc/hosts file modifications and password less SSH #

Configuring Hadoop #

Running Hadoop YARN and Checking Installation #

Now read this

CQL - Continuous Query Language