April 18, 2015 – CodeThatAint

View list of Hadoop Files

>>hadoop fs -ls ..

Creating new Folder

>>hadoop fs -mkdir test

The above created file can be viewed in Hue

Adding Files to Hadoop File System

>>hadoop fs -put Test.txt test

Incase files need to be copied from more than one Directory use put command as Below

>>hadoop fs -put Test1 Test2 Test

Getting Files to Hadoop File System

>>hadoop fs -get Test.txt Test1

Deleting a File from Hadoop File System

>>hadoop fs -rm Test1/Test.txt

In the above case the file will be moved to the Trash

Deleting a File from Hadoop File System

>>hadoop fs -rm -skipTrash Test1/Test.txt

Deleting a File- Recursive Remove

>>hadoop fs -rmr -skipTrash Test1

View part of Data file

>> hadoop fs -cat /user/training/shakespeare.txt | tail -n5

Hadoop – Map Reduce

>> hadoop jar Test.jar T1 output

hadoop jar MapReduce.jar InputFile OutputFolder

Start hdfs daemons

>>  start-dfs.sh

Start MapReduce daemons:

>>  start-yarn.sh

Verify Hadoop daemons:

>>  jps

For one JVM (Isolated Process)there will be
Job Tracker – one(Controller and scheduler)
Task Tracker – One per Cluster(Monitors task)

The Map Reduce consist of Two Parts

The Map Part
The Reduce Part

Map Part

Function in java which perform some action in some data.The Map reduce is run as a job.During this run of Map Reduce as a job the Java function gets called in each Node where the data lives.
The Map Reduce runs 3 Nodes (default HDFS cluster is replicated 3 Times).
HDFS is self healing.If one goes down other will be used
Once the MapReduce is run the output will be pairs
The second part is the Reduce Part in the pairs

2 Versions of Map Reduce

Map Reduce Version 1

As given by Google
HDFS Triple Replicated
Parallel Processing via Map and Reduce(aggregated)

Coding Steps

Create a Class
Create a static Map class
Create a static Reduce class
Create a Main Function

Create a Job
Job calls the Map and Reduce Classes

Java Coding for MapReduce

  public class MapReduce{
    public static void Main(String[] args)
    {
      //Create Job Runner Instance
      //Call MapInstance on Job Instance
      //Call ReduceInstance on Job Instance
        
    } 
    
    public void Map()
    {
       //write Mapper
    }

    public void Reduce()
    {
       //write Reducer
    }     
  }

In MapReduce the States should not be Shared
Top Down Programming, One Entry Point – One Exit Point

Aspects of MapReduce

Job – Unit of MapReduce
Map Task runs on each node
Reduce Task – runs on some nodes
Source date – HDFS or other location(amazon s3)

In Java while transferring data over network we serialize and deserialize values for security purposes.In MapReduce the Map output is serialized and the input is deserialized in Reduce.Serialized and Deserialized values are called as Writables in MapReduce. To acheive this String in java is replaced with Text and int in Java is replaced with IntWritable which does the serialization on it own.

Hadoop – Map Reduce

>> hadoop jar MapReduce.jar T1 output

hadoop jar MapReduce.jar InputFile OutputFolder

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

CodeThatAint

Daily Archives: April 18, 2015

Hadoop Terminal Commands

Big Data Map Reduce