Table Creation

CREATE TABLE HomeNeeds(Type STRING, Product STRING, No INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TextFile;

Insertion

LOAD DATA LOCAL INPATH '/home/turbo/workspace/Sample Datas/Test.csv'
OVERWRITE INTO TABLE HomeNeeds;

Create Table with Partition

CREATE TABLE HomeNeeds(Type String, Product String, No Int)
PARTITIONED BY (Date String, Country String)  
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','; 

The Partitioned columns and Table columns have no Relations with one another

Inserting into Partitioned Table

LOAD DATA LOCAL INPATH '/home/turbo/workspace/Sample Datas/Test.csv' 
INTO TABLE HomeNeeds
PARTITION (Date='2001-01-25', Country='India');

Partition and Bucketing

CREATE TABLE HomeNeeds(Type String, Item String, No Int)
PARTITIONED BY (Area String)
CLUSTERED BY (Type) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';





	
 package com.mugil.pig;

import java.io.IOException;

import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;

public class FilterType  extends FilterFunc {
	
	@Override
	public Boolean exec(Tuple tuple) throws IOException {
		
		if(tuple == null || tuple.size() == 0)		
		 return false;
		
		try {
			Object obj = tuple.get(0);
			
			if(obj == null)			
			 return false;
			
			String Type = (String)obj;
			
			if(Type.equals("Kitchen"))
			 return true;
			
		} catch (Exception e) {
			throw new IOException("Caught exception processing input row " + e.getMessage(), e);
		}
			
		return false;
	}
}

Registering UDF Function

grunt> REGISTER  /usr/local/pig-0.15.0/FilterByType3.jar;                  
grunt> DEFINE FilterType com.mugil.pig.FilterType();         
grunt> filtered_records = FILTER records BY FilterType(Type);
grunt> DUMP filtered_records;
Posted in Pig.

Binary search is faster then linear search if the collection is sorted and does not contains duplicated Values

public static void BinarySearch(int searchVal)
{
  int lowerIndex = 0;
  int higherIndex = arrNumbers.length;
  int searchIndex = 0;
  
  while(lowerIndex < higherIndex)
  {
	int middleIndex = (lowerIndex + higherIndex)/2;
	
	if(searchVal < arrNumbers[middleIndex])
	{
		higherIndex = middleIndex + 1; 
	}
	else if(searchVal > arrNumbers[middleIndex])
	{
		lowerIndex = middleIndex - 1;
	}  
	else
	{
		searchIndex = middleIndex+1;			
		System.out.println("The element is Found at Index " + searchIndex);
		return;
	}
  }
}

Bubble Sort

public void bubbleSort()
{
	for (int i = arrNumbers.length-1; i>1 ; i--) 
	{	
		for (int j = 0; j < i; j++) 
		{			
			if(arrNumbers[j] > arrNumbers[j+1])
			{
				swapValuesAtIndex(j, j+1);					
			}
			
			/*IterationDisplay(arrNumbers, j);*/
		}
	}
}

Selection Sort
Selection sort works by dividing the list into 2 Parts. Sorted and Unsorted.Taking one element at a time as Minimum element it works by comparing the minimum element with other elements in the list.

public void selectionSort()
{
	int minElement = 0;
	
	for (int i = 0; i< arrNumbers.length ; i++) 
	{
     	minElement = i;
		
		for (int j = i; j < arrNumbers.length; j++) 
		{			
			if(arrNumbers[minElement] > arrNumbers[j])
			{						
				minElement =  j;
			}
		}
		
		swapValuesAtIndex(minElement, i);
	}
}

Insertion Sort
Insertion sort is the best sorting method compared to others.The list is divided into sorted and unsorted portion. Once a no is selected for comparison it will not ened without placing the no at the correct location.

public void insertionSort()
{	
	for (int i = 1; i < arrNumbers.length; i++) 
	{
		int j = i;
		int toCompare = arrNumbers[i];
		
		//holds no to Insert - arrNumbers[j-1]
		while((j>0) && (arrNumbers[j-1] > toCompare))
		{
			arrNumbers[j] = arrNumbers[j-1];
			j--;
		}
		
		arrNumbers[j] =  toCompare;		
	}
}
  1. Linear search is faster when searching for a element in a collection where the elements are duplicated and occurs multiple time. Binary Search is efficient when the collection elements are unique
public static String[] removeElements(String[] input, String deleteMe) 
{
    List result = new LinkedList();

    for(String item : input)
        if(!deleteMe.equals(item))
            result.add(item);

    return result.toArray(input);
}

Loading CSV File using Pig Script

A = LOAD '...' USING PigStorage(',') AS (...);

Filtering NULL values in chararray
Example1: null as chararray
input.txt

1,2014-04-08 12:09:23.0
2,2014-04-08 12:09:23.0
3,null
4,null

Pig:

A = LOAD 'input.txt' USING PigStorage(',') AS (f1:int,f2:chararray);
B = FILTER A BY f2!='null';
DUMP B;

Example2: Real null value
input.txt

1,2014-04-08 12:09:23.0
2,2014-04-08 12:09:23.0
3,
4,

Pig:

A = LOAD 'input.txt' USING PigStorage(',') AS (f1:int,f2:chararray);
B = FILTER A BY f2 is not null;
DUMP B;

Output:

(1,2014-04-08 12:09:23.0)
(2,2014-04-08 12:09:23.0)

Finding Max from CSV File

Test.csv

Maruthi,10
Maruthi,55
Suziki,50
Honda,4
Maruthi,40
Suziki,60
Honda,14
BMW,140
Benz,5
a1 = LOAD 'Test.csv' USING PigStorage(',') AS (Car:chararray, No:int);
DESCRIBE a1;

Output

 a1: {Car: chararray,No: int}
b1 = GROUP a1 BY Car;
DESCRIBE b1;
 b1: {group: chararray,a1: {(Car: chararray,No: int)}}
DUMP b1;

Output

(BMW,{(BMW,140)})
(Benz,{(Benz,5)})
(Honda,{(Honda,4),(Honda,14)})
(Suziki,{(Suziki,50),(Suziki,60)})
(Maruthi,{(Maruthi,10),(Maruthi,55),(Maruthi,40)})
(,{(,)})
 c1 = FOREACH b1 GENERATE group, MAX(a1.No);

Output

(BMW,140)
(Benz,5)
(Honda,14)
(Suziki,60)
(Maruthi,55)

Filtering Empty Records
Corrpted Record tsv content

HouseHold,Soap,2
Kitchen,Oil,2
HouseHold,Sweeper,2
PoojaItems,Sandal
Kitchen,Rice,30
HouseHold,,1
Kitchen,Sugar,5
HouseHold,Shampoo,2
PoojaItems,Champor,10
HouseHold,Soap,2
filtered_records = FILTER records BY Item is null OR No is null;

Getting Count of Corrupted Records

records = LOAD 'Test.csv' USING PigStorage(',') AS (Type:chararray, Item:chararray, No:int);
filtered_records = FILTER records BY Item is null OR No is null;
grouped_records = GROUP filtered_records BY Type;
DESCRIBE grouped_records;
 grouped_records: {group: chararray,filtered_records: {(Type: chararray,Item: chararray,No: int)}} 
corrupt_records = FOREACH grouped_records GENERATE group , COUNT(filtered_records);
(HouseHold,1)
(PoojaItems,1)

Writing macros to find Maximum Item Sold

DEFINE max_item_sold(Records, Type, No) RETURNS c 
{ 
 b = GROUP $Records BY $Type;                        
 $c = FOREACH b GENERATE group, MAX($Records.$No);  
}; 
 max_type_sold = max_item_sold(records, Type, No);
Posted in Pig.

Finding Friends via map reduce

Link

Difference between combiner and Reducer

  • One constraint that a Combiner will have, unlike a Reducer, is that the input/output key and value types must match the output types of your Mapper.
  • Combiners can only be used on the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} . This also means that combiners may operate only on a subset of your keys and values or may not execute at all, still you want the output of the program to remain same.
  • Reducers can get data from multiple Mappers as part of the partitioning process. Combiners can only get its input from one Mapper.

difference between Hdfs block and input split

  • Block is the physical part of disk which has minimum amount of data that can be read or write. The actual size of block is decided during the design phase.For example, the block size of HDFS can be 128MB/256MB though the default HDFS block size is 64 MB.
  • HDFS block are physically entity while Input split is logical partition.
  • What is logical partition –> Logical partition means it will has just the information about blocks address or location. In the case where last record (value) in the block is incomplete,the input split includes location information for the next block and byte offset of the data needed to complete the record.

Counters Example Link

User Defined Counters Example

What is GenericOptionsParser
GenericOptionsParser is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser recognizes several standard command line arguments, enabling applications to easily specify a namenode, a ResourceManager, additional configuration resources etc.

The usage of GenericOptionsParser enables to specify Generic option in the command line itself

Eg: With Genericoption you can specify the following

>>hadoop jar /home/hduser/WordCount/wordcount.jar WordCount -Dmapred.reduce.tasks=20 input output

GenericOptionsParser vs ToolRunner

There’s no extra privileges, but your Command line options get run via the GenericOptionsParser, which will allow you extract certain configuration properties and configure a Configuration object from it

By using ToolRunner.run(), any hadoop application can handle standard command line options supported by hadoop. ToolRunner uses GenericOptionsParser internally. In short, the hadoop specific options which are provided command line are parsed and set into the Configuration object of the application.

eg. If you say:

>>hadoop MyHadoopApp -D mapred.reduce.tasks=3

Then ToolRunner.run(new MyHadoopApp(), args) will automatically set the value parameter mapred.reduce.tasks to 3 in the Configuration object.

Basically rather that parsing some options yourself (using the index of the argument in the list), you can explicitly configure Configuration properties from the command line:

hadoop jar myJar.jar com.Main prop1value prop2value
public static void main(String args[]) {
    Configuration conf = new Configuration();
    conf.set("prop1", args[0]);
    conf.set("prop2", args[1]);

    conf.get("prop1"); // will resolve to "prop1Value"
    conf.get("prop2"); // will resolve to "prop2Value"
}

Becomes much more condensed with ToolRunner:

hadoop jar myJar.jar com.Main -Dprop1=prop1value -Dprop2=prop2value

public int run(String args[]) {
    Configuration conf = getConf();

    conf.get("prop1"); // will resolve to "prop1Value"
    conf.get("prop2"); // will resolve to "prop2Value"
}

GenericOptionsParser, Tool, and ToolRunner for running Hadoop Job

Hadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally:

public interface Tool extends Configurable {
        int run(String [] args) throws Exception;
    }

Below example shows a very simple implementation of Tool, for running the Hadoop Map Reduce Job.

public class WordCountConfigured extends Configured implements Tool {
        @Override
        public int run(String[] args) throws Exception {
        Configuration conf = getConf();

        return 0;
        }        
    }
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new WordCountConfigured(), args);
        System.exit(exitCode);
    }

We make WordCountConfigured a subclass of Configured, which is an implementation of the Configurable interface. All implementations of Tool need to implement Configurable (since Tool extends it), and subclassing Configured is often the easiest way to achieve this. The run() method obtains the Configuration using Configurable’s getConf() method, and then iterates over it, printing each property to standard output.

WordCountConfigured’s main() method does not invoke its own run() method directly. Instead, we call ToolRunner’s static run() method, which takes care of creating a Configuration object for the Tool, before calling its run() method. ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line, and set them on the Configuration instance. We can see the effect of picking up the properties specified in conf/hadoop-localhost.xml by running the following command:

    >>hadoop WordCountConfigured -conf conf/hadoop-localhost.xml -D mapred.job.tracker=localhost:10011 -D mapred.reduce.tasks=n

Options specified with -D take priority over properties from the configuration files. This is very useful: you can put defaults into configuration files, and then override them with the -D option as needed. A common example of this is setting the number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will override the number of reducers set on the cluster, or if set in any client-side configuration files. The other options that GenericOptionsParser and ToolRunner support are listed in Table.