View list of Hadoop Files

>>hadoop fs -ls ..

Creating new Folder

>>hadoop fs -mkdir test

The above created file can be viewed in Hue

Adding Files to Hadoop File System

>>hadoop fs -put Test.txt test

Incase files need to be copied from more than one Directory use put command as Below

>>hadoop fs -put Test1 Test2 Test

Getting Files to Hadoop File System

>>hadoop fs -get Test.txt Test1

Deleting a File from Hadoop File System

>>hadoop fs -rm Test1/Test.txt

In the above case the file will be moved to the Trash

Deleting a File from Hadoop File System

>>hadoop fs -rm -skipTrash Test1/Test.txt

Deleting a File- Recursive Remove

>>hadoop fs -rmr -skipTrash Test1

View part of Data file

>> hadoop fs -cat /user/training/shakespeare.txt | tail -n5

Hadoop – Map Reduce

>> hadoop jar Test.jar T1 output

hadoop jar MapReduce.jar InputFile OutputFolder

Start hdfs daemons

>>  start-dfs.sh

Start MapReduce daemons:

>>  start-yarn.sh

Verify Hadoop daemons:

>>  jps

For one JVM (Isolated Process)there will be
Job Tracker – one(Controller and scheduler)
Task Tracker – One per Cluster(Monitors task)

The Map Reduce consist of Two Parts

The Map Part
The Reduce Part

Map Part

  1. Function in java which perform some action in some data.The Map reduce is run as a job.During this run of Map Reduce as a job the Java function gets called in each Node where the data lives.
  2. The Map Reduce runs 3 Nodes (default HDFS cluster is replicated 3 Times).
  3. HDFS is self healing.If one goes down other will be used
  4. Once the MapReduce is run the output will be pairs
  5. The second part is the Reduce Part in the pairs

2 Versions of Map Reduce

Map Reduce Version 1

  1. As given by Google
  2. HDFS Triple Replicated
  3. Parallel Processing via Map and Reduce(aggregated)

Coding Steps

  1. Create a Class
  2. Create a static Map class
  3. Create a static Reduce class
  4. Create a Main Function
    1. Create a Job
    2. Job calls the Map and Reduce Classes

Java Coding for MapReduce

  public class MapReduce{
    public static void Main(String[] args)
    {
      //Create Job Runner Instance
      //Call MapInstance on Job Instance
      //Call ReduceInstance on Job Instance
        
    } 
    
    public void Map()
    {
       //write Mapper
    }

    public void Reduce()
    {
       //write Reducer
    }     
  }
  1. In MapReduce the States should not be Shared
  2. Top Down Programming, One Entry Point – One Exit Point

Aspects of MapReduce

  1. Job – Unit of MapReduce
  2. Map Task runs on each node
  3. Reduce Task – runs on some nodes
  4. Source date – HDFS or other location(amazon s3)

In Java while transferring data over network we serialize and deserialize values for security purposes.In MapReduce the Map output is serialized and the input is deserialized in Reduce.Serialized and Deserialized values are called as Writables in MapReduce. To acheive this String in java is replaced with Text and int in Java is replaced with IntWritable which does the serialization on it own.

Hadoop – Map Reduce

>> hadoop jar MapReduce.jar T1 output

hadoop jar MapReduce.jar InputFile OutputFolder