Raspberry Pi Hadoop Cluster

If you would like to learn more about Distributed Computing and Big Data processing by creating your own Raspberry Pi Hadoop Cluster this tutorial is just what you need.

Tutorial assumes that you do not have any previous knowledge of Hadoop. Hadoop is a framework for storage and processing of large amount of data. Running Hadoop on a Raspberry Pi is not the most efficient approach. But for learning and experimenting purposes Raspberry Pis are perfect solution.

We will start of with using one Raspberry Pi as a single node in our Hadoop cluster. This is very useful for development work. Then later on we will add three more Pis, after our first node is working. This would resemble a practical deployment.

Linux is official development and production platform for Hadoop, although Windows is a supported development platform as well. For a Windows machine, you will need to install cygwin (http://www-cygwin.com/) to enable shell and Unix scripts

Running Hadoop requires Java (version 1.6 or higher). You can download the latest JDK for your operating system from Oracle web page.

 

Table of Contents

 

Fundamentals of Hadoop

 

What is Hadoop?

"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

@ hadoop.apache.org

 

Components of Hadoop

Fully configured cluster runs set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. the daemons include:

  • NameNode
  • DataNode
  • Secondary NameNode
  • JobTracker
  • TaskTracker
 

Daemons/services

Daemons/services Description
NameNode Runs on Master node. Manages the HDFS file system on the cluster.
Secondary NameNode WE ARE MISSING THIS
JobTracker Manages MapReduce jobs and distributes them to the nodes in the cluster.
DataNode Runs on slave node. Acts as HDFS file storage.
TaskTracker Runs MapReduce jobs which are received from the JobTracker.
 

Start/Stop Scripts

These scripts should be executed from the NameNode. Which will trigger SSH connections to daemons that will start on all configured nodes in the cluster. (Nodes that are defined in opt/hadoop/etc/hadoop/slave)

Script Description
start-dfs.sh Starts NameNode, Secondary NameNode and DataNode(s)
stop-dfs.sh Stops NameNode, Secondary NameNode and DataNode(s)
START MAPREDUCE MISSING
STOP MAPREDUCE MISSING
 

Web Interface (default ports)

Status and information of Hadoop daemons can be viewed from a web browser.

Daemons/services Port
NameNode 50070
Secondary NameNode 50090
JobTracker 50030
JobTracker 50030
DataNode(s) 50075
TaskTracker(s) 50060
 

The Setup

Name IP Hadoop Roles
Master 192.168.50.1 NameNode
Secondary NameNode
JobTracker
Slave_01 192.168.50.11 DataNode
TaskTracker
Slave_02 192.168.50.12 DataNode
TaskTracker
Slave_03 192.168.50.13 DataNode
TaskTracker

NOTE: Make sure that you configure name and IP addresses correctly to fit your environment

 

Single Node Setup

 

Install Raspbian

WE COULD LINK THIS PART TO THE YOUR PAGE THAT WE JUST INSTALLATION OF OS

 

Configure Network

Install a text editor of your choice and edit as root user or append sudo before the command. I like to use nano editor.

WE COULD LINK THIS PART TO THE YOUR PAGE THAT WE JUST INSTALLATION OF OS

sudo nano /etc/network/interface
 

Hardware

List of pi's

  • 3 Raspberry Pi Model 2 B
  • 1x Gigabit Switch (8 port "D-Link" DGS-108)
  • 4x Micro SD cards
  • 4x USB cables
  • and so on
 

Java Configuration

Newer images such as (2014-09-09-wheezy-raspbian.img) come with Java environment preinstalled. To verify that the environment is installed simply type a terminal command:

$
java -version
 

If Java Environment is installed in your distribution of linux terminal response message should look similar to this:

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) Client VM (build 25.65-b01, mixed mode)
 

Prepare Hadoop User Account and Group

$
sudo addgroup hadoop
$
sudo adduser --ingroup hadoop hduser
$
sudo adduser hduser sudo
 

Configure SSH

 

Create SSH RSA pair keys with blank password in order for hadoop nodes to be able to talk with each other without promting for password.

$
su hduser
$
mkdir ~/.ssh
$
ssh-keygen -t rsa -P ""
$
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
 

Verify that hduser can login to SSH

 
$
su hduser
$
ssh localhost
 

Exit back to previous shell (pi/root)

 

Install Hadoop

 

Run this command to return to user's home folder

$
cd ~/

Download

 

I would suggest installing newest version of Hadoop. But do not forget that newer version configuration files or other setting could be place in differently. If you run into problems, consult the documentation on Apache Hadoop web page.

 
$
wget http://apache.mirrors.spacedump.net/hadoop/core/hadoop-2.7.2/hadoop-2.7.2.tar.gz

Install

 
sudo mkdir /opt
sudo tar -xvzf hadoop-2.7.2.tar.gz -C /opt/
cd /opt/
sudo mv hadoop-2.7.2 hadoop
sudo chown -R hduser:hadoop hadoop
 

Configure Environment Variables

 

I assume that you are using version of Raspbian with Java already preinstalled.

 

Add Hadoop to environment variables by adding these lines to the end of the /etc/bash.bashrc file with your favorite command editor.

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin

Exit (logout) and then reopen the shell to make sure that you can gain access to hadoop excacutable files outside the

/opt/hadoop/bin/
folder.

exit
su hduser
hadoop version
 

This should be the response. Depending on your version

hduser@Master:/opt/hadoop/bin $ hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar