Reading constant from properties file
constant.properties

PointA.PointX=-20
PointA.PointY=0

spring.xml

<beans>
	<bean id="pointA" class="com.mugil.shapes.Point">
		<property name="x" value="${PointA.PointX}"/>
		<property name="y" value="${PointA.PointY}"/>
	</bean>
	<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
		<property name="location" value="constant.properties"></property>
	</bean>
</beans>

Using Interface
spring.xml

<beans>
	<bean id="pointA" class="com.mugil.shapes.Point">
		<property name="x" value="${PointA.PointX}"/>
		<property name="y" value="${PointA.PointY}"/>
	</bean>
	<bean class="com.mugil.shapes.Circle" id="circleId">
		<property name="center" ref="pointA"/>
	</bean>
</beans>

shape.java

public interface Shape { 
	public void drawShape();
}

triangle.java

public class Triangle implements Shape
{
 public void drawShape()
 {
   System.out.println("Shape of Triangle");
 }
}

circle.java

public class Circle implements Shape{
	private Point center;
	
	@Override
	public void drawShape() {
		System.out.println("Shape of Circle ");
		System.out.println("Center of Cirlce "+ getCenter().getX() + " - " + getCenter().getY());
	}

	public Point getCenter() {
		return center;
	}

	public void setCenter(Point center) {
		this.center = center;
	}		
}

The objShape will call the draw method based on the instance referenced at runtime
shape.java

public class DrawingApp {
	public static void main(String[] args)  {
		 ApplicationContext objContext = new ClassPathXmlApplicationContext("spring1.xml");
		Shape objShape =  (Shape)objContext.getBean("circleId");
		objShape.drawShape();
	}
}

ApplicationContextAware and BeanNameAware

  1. The Aware interface has the feel of the listener, callback, or observer design patterns.
  2. Aware interface, which is a super interface to the two.The xxxAware interface is a common pattern used within the Spring framework.
  3. They are typically used to allow a Spring managed bean to be given an object (via the interfaces setXxx method) at Spring bootstrap time.During bootstrapping, Spring will examine each bean to determine if it implements any of the xxxAware interfaces. When one is found, it invokes the interface method, providing the piece of information that is being asked for.

In Spring Bean Lifecycle from creation to Destruction is managed by Spring Container. There may be scenarios where we would want to access bean created by spring container from non spring managed class. The beans created by spring container is available in ApplicationContext. Whenever there are any changes to the bean it would be updated in applicationcontext.
By implementing ApplicationContextAware in the bean which should be accessed outside and calling the setApplicationContext when new ClassPathXmlApplicationContext is called from outside class we can have access to bean from context.

SpringBeans.xml

<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
	<bean id="helloBean" class="com.mkyong.core.HelloWorld">
		<property name="name" value="Mugil" />
	</bean>
</beans>

HelloWorld.java – Spring Bean

public class HelloWorld implements ApplicationContextAware
{
 private String name;
 private ApplicationContext objContext = null;


 public void setName(String name) {
  this.name = name;
 }

 public void printHello() {
  System.out.println("Spring 3 : Hello ! " + name);
 }

 public void setApplicationContext(ApplicationContext arg0) throws BeansException {
  this.objContext = arg0;
  System.out.println("Called when Object to new ClassPathXmlApplicationContext('springbeans.xml') is created");
 }
}

App.java

public class App 
{
 public static void main(String[] args) 
 {
  ApplicationContext context = new ClassPathXmlApplicationContext("SpringBeans.xml");

  HelloWorld obj = (HelloWorld) context.getBean("helloBean");
  obj.printHello();
 }
}

Output

Called when Object to new ClassPathXmlApplicationContext('springbeans.xml') is created
Spring 3 : Hello ! Mugil

When spring instantiates beans, it looks for a couple of interfaces like ApplicationContextAware and InitializingBean. If they are found, the methods are invoked.

Class<?> beanClass = beanDefinition.getClass();
Object bean = beanClass.newInstance();
if (bean instanceof ApplicationContextAware) 
{
    ((ApplicationContextAware) bean).setApplicationContext(ctx);
}

In newer version it may be better to use annotations, rather than implementing spring-specific interfaces

@Inject // or @Autowired
private ApplicationContext ctx

When BeanPostProcessor implement Object to execute the postProcessBeforeInitialization method,for example ApplicationContextAwareProcessor that added before.

private void invokeAwareInterfaces(Object bean) {
        if (bean instanceof Aware) {
            if (bean instanceof EnvironmentAware) {
                ((EnvironmentAware) bean).setEnvironment(this.applicationContext.getEnvironment());
            }
            if (bean instanceof EmbeddedValueResolverAware) {
                ((EmbeddedValueResolverAware) bean).setEmbeddedValueResolver(
                        new EmbeddedValueResolver(this.applicationContext.getBeanFactory()));
            }
            if (bean instanceof ResourceLoaderAware) {
                ((ResourceLoaderAware) bean).setResourceLoader(this.applicationContext);
            }
            if (bean instanceof ApplicationEventPublisherAware) {
                ((ApplicationEventPublisherAware) bean).setApplicationEventPublisher(this.applicationContext);
            }
            if (bean instanceof MessageSourceAware) {
                ((MessageSourceAware) bean).setMessageSource(this.applicationContext);
            }
            if (bean instanceof ApplicationContextAware) {
                ((ApplicationContextAware) bean).setApplicationContext(this.applicationContext);
            }
        }
    }

When it is Invoked
Shape.java

//Application System out would be Printed
ApplicationContext objContext = new ClassPathXmlApplicationContext("spring1.xml");

//this.strTriangle would be printed
BeanFactory  objBeanFactory = new XmlBeanFactory(new FileSystemResource("spring.xml"));
Triangle objTriangle2 =  (Triangle)objBeanFactory.getBean("triangleName");

BeanFactoryAware
BeanFactory is used for BeanFactoryAware whereas ApplicationContextAware is used for ApplicationContext.Note that the ApplicationContext interface is a subclass of BeanFactory, and provides additional methods on top of the basic BeanFactory interface.

BeanNameAware Interface
Bean implementing this interface can get its name defined in the Spring container
One possible area of use could be if your building on/ extending the spring framework and would like to acquire the bean name for logging purposes/wiring them etc.

MyBeanName.java

public class MyBeanName implements BeanNameAware 
{
 @Override
 public void setBeanName(String beanName) 
 {
  System.out.println(beanName);
 }
}

Config.java

@Configuration
public class Config 
{ 
    @Bean(name = "myCustomBeanName")
    public MyBeanName getMyBeanName() {
        return new MyBeanName();
    }
}
myCustomBeanName
  1. beanName property represents the bean id registered in the Spring container.When a new bean is given a name in the spring container and if you want to access the name then BeanNameAware should be used
  2. In the above example when the code is run it outputs myCustomBeanName which is the name offered to bean by container at runtime.
  3. If no name is given its going to print name of the method – getMyBeanName as bean name

If you require access to the additional features available on an ApplicationContext? If so, then you should of course use ApplicationContextAware. If not, BeanFactoryAware will be sufficient.Amongst many other things, an ApplicationContext has additional methods for inspecting the beans e.g. containsBeanDefinition, getBeanDefinitionCount, getBeanDefinitionNames, getBeanNamesForType, getBeansOfType that may be useful to you and which are not available on BeanFactory

we should avoid using any of the Aware interfaces, unless we need them. Implementing these interfaces will couple the code to the Spring framework.

————————————————————————————————————————————————————

Inheriting Bean Definition
spring.xml

<beans>
	<bean id="parentTriangle" class="com.mugil.shapes.Triangle">
		<property name="pointA">
			<ref bean="pointA"/>
		</property>
	</bean>
	<bean id="triangleId" class="com.mugil.shapes.Triangle" parent="parentTriangle">
		<property name="pointB" ref="pointB"/>
		<property name="pointC" ref="pointC"/>
	</bean>
<beans>

Above the bean parentTriangle is inherited by the child bean triangleId

spring.xml
Bean definition can also be made as abstract by using abstract=”true” like one below

<beans>
	<bean id="parentTriangle" class="com.mugil.shapes.Triangle" abstract="true">
		<property name="pointA">
			<ref bean="pointA"/>
		</property>
	</bean>
</beans>

Managing Lifecycle of Bean

  1. Note the object for Context objContext is referred using AbstractApplicationContext not ApplicationContext
  2. registerShutdownHook() registers a hook which gets called at the end of application for cleanup

Shape.java

 AbstractApplicationContext objContext = new ClassPathXmlApplicationContext("spring1.xml");
 objContext.registerShutdownHook();

Triangle.java

public class Triangle implements InitializingBean, DisposableBean 
{
 @Override
 public void destroy() throws Exception {
   System.out.println("DisposableBean Called");
 }

 @Override
 public void afterPropertiesSet() throws Exception {
   System.out.println("InitializingBean Called");
 }
}

We can all initialize the methods which should be called for initialization in spring.xml as below
spring.xml

<bean id="triangleId" class="com.mugil.shapes.Triangle" init-method="myInit" destroy-method="myDestroy">
                <property name="pointA" ref="pointA"/>
		<property name="pointB" ref="pointB"/>
		<property name="pointC" ref="pointC"/>
	</bean>

Triangle.java

public class Triangle implements InitializingBean, DisposableBean 
{
  public void myInit()
  {
    System.out.println("Custom Init Method");
  }
	
  public void myDestroy()
  {
    System.out.println("Custom Destroy Method");
  }
}

Priority of method call when it is defined by using XML and Interface implementation

  1. Init method of XML will be called
  2. Custom Init method of Interface will be called
  3. Destroy method of XML will be called
  4. Custom Destroy method of Interface will be called

Bean Post Processor

  1. Will work before and after bean initialization
  2. Works only when called using application initialization of bean. Does not work with setter initialization
  3. called for every initialization of parent and child bean in class

Bean Post Processor

<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle">
	    <property name="pointA" ref="pointA"/>
		<property name="pointB" ref="pointB"/>
		<property name="pointC" ref="pointC"/>
	</bean>
	
	<bean id="pointA" class="com.mugil.shapes.Point">
		<property name="x" value="-20"/>
		<property name="y" value="0"/>
	</bean>
	<bean id="pointB" class="com.mugil.shapes.Point">
		<property name="x" value="0"/>
		<property name="y" value="0"/>
	</bean>
	<bean id="pointC" class="com.mugil.shapes.Point">
		<property name="x" value="20"/>
		<property name="y" value="0"/>
	</bean>
	
	<bean class="com.mugil.shapes.BeanInitialization"/>
	
</beans>

BeanInitialization.java

public class BeanInitialization implements BeanPostProcessor
{
	@Override
	public Object postProcessAfterInitialization(Object obj, String objName) throws BeansException 
       {
	  System.out.println("After Initialization of " + objName);
	  return obj;
	}

	@Override
	public Object postProcessBeforeInitialization(Object obj, String objName) throws BeansException 
       {
	  System.out.println("Before Initialization of " + objName);
	  return obj;
	}

}

So the above code runs four times for bean initialization of pointA,pointB,pointC and Triangle

shape.java

public class Shape 
{ 
  public static void main(String[] args)  
  {
    ApplicationContext objContext = new ClassPathXmlApplicationContext("spring1.xml");
    Triangle objTriangle2 =  (Triangle)objContext.getBean("triangleId");
    . 
    .
    .
}

BeanFactoryPostProcessor initialization happens before the beans gets initialized in the bean factory.

BeanInitialization2.java

public class BeanInitialization2 implements BeanFactoryPostProcessor
{
  @Override
  public void postProcessBeanFactory(ConfigurableListableBeanFactory arg0) throws BeansException 
  {
     System.out.println("This is Bean factory Post Processor");
  }
}

spring.xml

<beans>
  <bean class="com.mugil.shapes.BeanInitialization2"/>	
</beans>

By using Partitioner we can group the output based on specific column.The Column based on which the output should be grouped is used for Partition.In below case I have used Second Value of TextPair key for grouping.

The Output of reducer will be equal to Hash Modulo Denominator

The Below Custom Partioner again makes use of HashCode and divides by the Total number of reducer.

PartitionValue = (HashCode Value of String x Max Val of Integer)/Total No of Reducers;

package com.mugil.part;

import org.apache.hadoop.mapreduce.Partitioner;

import com.mugil.avg.LongPair;
import com.mugil.avg.TextPair;

public class FirstPartioner extends Partitioner<TextPair, LongPair>
{

   @Override
   public int getPartition(TextPair arg0, LongPair arg1, int noOfReducers) 
   {
	int partitionValue = 0 ;		
	partitionValue = (arg0.getSecond().hashCode() & Integer.MAX_VALUE)%noOfReducers;		
	return partitionValue;
   } 
}

spring.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
    <bean id="triangleId" class="com.mugil.shapes.Triangle"/>
</beans>

Triangle.java

package com.mugil.shapes;

public class Triangle {
	private String type;
	
	public String getType() {
		return type;
	}

	public void setType(String type) {
		this.type = type;
	}

	public void drawShape()
	{
		System.out.println("Shape of Triangle");
	}
}

Reading value of Bean by Application Context and Bean Factory

public class Shape 
{
   public static void main(String[] args) 
   {
	ApplicationContext objContext = new ClassPathXmlApplicationContext("spring1.xml");
	Triangle objTriangle1 =  (Triangle)objContext.getBean("triangleId");
	objTriangle1.drawShape();
		
		
	BeanFactory  objBeanFactory = new XmlBeanFactory(new FileSystemResource("spring.xml"));
	Triangle objTriangle2 =  (Triangle)objBeanFactory.getBean("triangleId");
	objTriangle2.drawShape();
    }
}

Constructor Initialization
Triangle.java

package com.mugil.shapes;

public class Triangle 
{
	private String type;

        public Triangle(String ptype)
	{
		this.type = ptype;
	}
	
        public void drawShape()
	{
		System.out.println("Shape of Triangle");
	}
}

The Index Specifies which variable in the Bean is Initialized

<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle">
		<constructor-arg index="0" value="Isolseles"/>
	</bean>
</beans>

Real World Dependency Injection
spring.xml

 
<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle">
		<property name="pointA" ref="point1"/>
		<property name="pointB" ref="point2"/>
		<property name="pointC" ref="point3"/>
	</bean>
	
	<bean id="point1" class="com.mugil.shapes.Point">
		<property name="x" value="-20"/>
		<property name="y" value="0"/>
	</bean>
	<bean id="point2" class="com.mugil.shapes.Point">
		<property name="x" value="0"/>
		<property name="y" value="0"/>
	</bean>
	<bean id="point3" class="com.mugil.shapes.Point">
		<property name="x" value="20"/>
		<property name="y" value="0"/>
	</bean>
	
</beans>

Triangle.java

 
public class Triangle {
	private String type;
	private Point pointA;
	private Point pointB;
	private Point pointC;
	
	public String getType() {
		return type;
	}

	public void setType(String type) {
		this.type = type;
	}

	public void drawShape()
	{
		System.out.println("Shape of Triangle");
	}

	public Point getPointA() {
		return pointA;
	}

	public void setPointA(Point pointA) {
		this.pointA = pointA;
	}

	public Point getPointB() {
		return pointB;
	}

	public void setPointB(Point pointB) {
		this.pointB = pointB;
	}

	public Point getPointC() {
		return pointC;
	}

	public void setPointC(Point pointC) {
		this.pointC = pointC;
	}
}

Point.java

 
package com.mugil.shapes;

public class Point {
	private int x;
	private int y;
	
	public int getX() {
		return x;
	}
	public void setX(int x) {
		this.x = x;
	}
	public int getY() {
		return y;
	}
	public void setY(int y) {
		this.y = y;
	}
}

Shape.java

 
import org.springframework.beans.factory.BeanFactory;
import org.springframework.beans.factory.xml.XmlBeanFactory;
import org.springframework.core.io.FileSystemResource;

public class Shape 
{
    public static void main(String[] args) 
    {		
	BeanFactory  objBeanFactory = new XmlBeanFactory(new FileSystemResource("spring.xml"));
	Triangle objTriangle2 =  (Triangle)objBeanFactory.getBean("triangleId");
		
	System.out.println("The Refereeed Points are = ");
	System.out.println("Point A :" + objTriangle2.getPointA().getX() +" " + objTriangle2.getPointA().getY());
	System.out.println("Point B :" + objTriangle2.getPointB().getX() +" " + objTriangle2.getPointB().getY());
	System.out.println("Point C :" + objTriangle2.getPointC().getX() +" " + objTriangle2.getPointC().getY());		
	}
}

Incase the value of the bean wont be referred else where you can define the bean property simple as below

<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle">
		<property name="pointA">
			<bean id="point1" class="com.mugil.shapes.Point">
				<property name="x" value="-20"/>
				<property name="y" value="0"/>
			</bean>			
		</property>
		<property name="pointB" ref="point2"/>
		<property name="pointC" ref="point3"/>
	</bean>

instead of

<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle">
		<property name="pointA" ref="point1"/>
		<property name="pointB" ref="point2"/>
		<property name="pointC" ref="point3"/>
	</bean>
	
	<bean id="point1" class="com.mugil.shapes.Point">
		<property name="x" value="-20"/>
		<property name="y" value="0"/>
	</bean>
.
.
.

Using Alias

<bean id="triangleId" class="com.mugil.shapes.Triangle" name="triangleName">
.
.
.
</bean>
<alias name="triangleId" alias="triangle-alias"/>

In Java we can refer either by name or by alias as below
Using Alias

.
.
Triangle objTriangle2 =  (Triangle)objBeanFactory.getBean("triangleName");
(or)
Triangle objTriangle2 =  (Triangle)objBeanFactory.getBean("triangle-alias");
.
.

Using List
Triangle.java

public class Triangle 
{
    private List<Point> points;	
	
    public List<Point> getPoints() 
    {
	return points;
    }

    public void setPoints(List<Point> points) 
    {
 	this.points = points;
    }			
}

spring.xml

<beans>
	<bean id="triangleId" class="com.mugil.shapes.Triangle" name="triangleName">
		<property name="points">
			<list>
				<ref bean="point1"/>
				<ref bean="point2"/>
				<ref bean="point3"/>
			</list>		
		</property>		
	</bean>
	<bean id="point1" class="com.mugil.shapes.Point">
			<property name="x" value="-20"/>
			<property name="y" value="0"/>
	</bean>
	<bean id="point2" class="com.mugil.shapes.Point">
		<property name="x" value="0"/>
		<property name="y" value="0"/>
	</bean>
	<bean id="point3" class="com.mugil.shapes.Point">
		<property name="x" value="20"/>
		<property name="y" value="0"/>
	</bean>
</beans>

Shape.java

List<Point> arrPoints = objTriangle2.getPoints();
		
 for (Point objPoint : arrPoints) 
 {
   System.out.println("Point :" + objPoint.getX() +" " + objPoint.getY());
 }

Autowiring
Autowiring can be done based on name as below, byType and byConstructor.

<bean id="triangleId" class="com.mugil.shapes.Triangle" name="triangleName" autowire="byName">	
</bean>
<bean id="pointA" class="com.mugil.shapes.Point">
  <property name="x" value="-20"/>
  <property name="y" value="0"/>
</bean>
<bean id="pointB" class="com.mugil.shapes.Point">
  <property name="x" value="0"/>
  <property name="y" value="0"/>
</bean>
<bean id="pointC" class="com.mugil.shapes.Point">
  <property name="x" value="20"/>
  <property name="y" value="0"/>
</bean>

The Name of the instanceVariable in class should match the autowired xml bean Name.
Triangle.java

 
public class Triangle {	
	private Point pointA;
	private Point pointB;
	private Point pointC;
.        
.
}

A binary file is a file whose content must be interpreted by a program or a hardware processor that understands in advance exactly how it is formatted. That is, the file is not in any externally identifiable format so that any program that wanted to could look for certain data at a certain place within the file. A progam (or hardware processor) has to know exactly how the data inside the file is laid out to make use of the file.

Hadoop does not work very well with a lot of small files, files that are smaller than a typical HDFS Block size as it causes a memory overhead for the NameNode to hold huge amounts of small files. Also, every map task processes a block of data at a time and when a map task has too little data to process, it becomes inefficient. Starting up several such map tasks is an overhead.

To solve this problem, Sequence files are used as a container to store the small files. Sequence files are flat files containing key, value pairs. A very common use case when designing ingestion systems is to use Sequence files as containers and store any file related metadata(filename, path, creation time etc) as the key and the file contents as the value.

A Sequence file can be have three different formats: An Uncompressed format, a Record Compressed format where the value is compressed and a Block Compressed format where entire records are compressed.There are sync markers for every few 100 bytes (approximately) that represent record boundaries.

Read from Here

  1. As binary files, these are more compact than text files
  2. Provides optional support for compression at different levels – record, block.
  3. Files can be split and processed in parallel
  4. As HDFS and MapReduce are optimized for large files, Sequence Files can be used as containers for large number of small files thus solving hadoop’s drawback of processing huge number of small files.
  5. Extensively used in MapReduce jobs as input and output formats. Internally, the temporary outputs of maps are also stored using Sequence File format.

A sequence file consists of a header followed by one or more records. All the three formats uses the same header structure.

  1. Uncompressed format
  2. Record Compressed format
  3. Block-Compressed format

Header Structure of Sequence Files

Record Structure of Sequence Files

Block Structure of Sequence Files

Read from Here

e.g. Assume that you are uploading images in facebook and you have to remove duplicate images. You can’t store image in textformat. What you can do : get MD5SUM of image file and if MD5SUM already exists in the system, just discard insertion of duplicate image. In your text file, you can simply have “Date:” and “Number of images uploaded”. Image can be stored out side of HDFS system like CDN network or at some other web server

Read from Here

Listing Files in Directory

import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;

public class MapReduceDriver extends Configured implements Tool
{
   public static void main(String[] args) throws Exception 
   {
	MapReduceDriver objMapReduceDriver = new MapReduceDriver();
		
	Configuration conf = new Configuration();
		
	FileSystem fs = FileSystem.get(conf);
	Path path = new Path(args[0]);
		
	FileStatus[] status = fs.listStatus(path);
	Path[] paths = FileUtil.stat2Paths(status);
		
	for (Path path2 : paths) 
        {
	  System.out.println(path2.toString());
	}
		
	int res = ToolRunner.run(objMapReduceDriver, args);
	System.exit(res);
   }
Path path = new Path(args[0]);
FileStatus[] status = fs.listStatus(path);
Path[] paths = FileUtil.stat2Paths(status);
		
for(Path path2 : paths) 
  csvPaths = String.join(",", path2.toString());

FileInputFormat.setInputPaths(objJob, csvPaths);

Merging Files in a Folder
copyMerge – Parameters

  1. FileSystem Object
  2. Input Path
  3. FileSystem Object
  4. Output Path
  5. Delete Orginal File
  6. null
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Path inputPath = new Path(args[0]);
Path outPath = new Path(args[2]);
		
boolean Merge = FileUtil.copyMerge(fs, inputPath, fs, outPath, false, conf, null);
		
if(Merge)
  System.out.println("Merge Successful");
		

globStatus takes patterns

Path path = new Path(args[0] + "/Inputs/Input*");
FileStatus[] status = fs.globStatus(path);

Merging Multiple Paths

 import org.apache.commons.lang.StringUtils;
 
 csvPaths = StringUtils.join(paths,",");
 String[] arrcsvPaths = csvPaths.split(",");

 for (int i = 0; i < arrcsvPaths.length; i++) 
  FileInputFormat.setInputPaths(objJob, arrcsvPaths[i]);	

Passing Arguments in Command Context and Fetching It

String filterWords =  context.getConfiguration().get("Word.Name");
				
for (int i = 0; i < arrString.length; i++) 
{	
  if(filterWords.equals(arrString[i].toString()))
    context.write(new Text(arrString[i].toString()), new IntWritable(1));
}

Input

 -DWord.Name=Tests /home/turbo/workspace/MapReduce5/src/Inputs/Inputs[1-2] /home/turbo/workspace/MapReduce5/src/Outputs/

Word.Name – is the parameter passed in Command Line.The Parameters should always passed as First Value.

The argument removes the parameter once the call to main method is over. So the args.length is 3 in main() and 2 in run method()

In map-reduce v.1 mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum are used to configure number of map slots and reduce slots accordingly in mapred-site.xml.

starting from map-reduce v.2 (YARN), containers is a more generic term is used instead of slots, containers represents the max number of tasks that can run in parallel under the node regardless being Map task, Reduce task or application master task (in YARN).

Suppose you have a TaskTracker with 32 GB of memory, 16 map slots, and 8 reduce slots. If all task JVMs use 1 GB of memory and all slots are filled, you have 24 Java processes with 1 GB each, for a total of 24 GB. Because you have 32 GB of physical memory, there is probably enough memory for all 24 processes. On the other hand, if your average map and reduce tasks need 2 GB of memory and all slots are full, the 24 tasks could need up to 48 GB of memory, more than is available. To avoid over-committing TaskTracker node memory, reduce the number of slots.

Question : I have 10 node cluster and each node has a quad core processor
so in total there are 80 slots (8×10). It is said that one map task is alloted
per slot(map slot). Suppose usable slot is 60 (say other slots are busy handling other)
daemons). If I have a file that has 1000 blocks, and then say 40 slots are
for map task and 20 for reduce task, how the blocks are gonna process?
as the number of map task starts on number of blocks available (1000 here)
but the available map slots are 40.

Answer : The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job.

==> Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size

Assume you have running hadoop on a cluster size of 4: Assume you set mapred.map.tasks and mapred.reduce.tasks parameters in your conf file to the nodes as follows:

Node 1: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 2: mapred.map.tasks = 2 and mapred.reduce.tasks = 2
Node 3: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 4: mapred.map.tasks = 1 and mapred.reduce.tasks = 1

Assume you set the above paramters for 4 of your nodes in this cluster. If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e.g(2 Processors, 2 Cores) and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor, 2 cores so can’t run more than 1 mapper and 1 reducer task.

So when you run the job Node 1, Node 2, Node 3, Node 4 are configured to run a max. total of (4+2+4+1)11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job. After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks.

Processor is the entire chipset including all the cores. Cores are like 2 (or more like 4 core, 6 core) parts of the processor that does parallel processing (processing two different data simultaneously in different units) which helps in multitasking without causing much strain on the processor. Each core itself is a processor technically. But the chipset is manufactured in such a way that the different cores work with coordination and not individually. An anlogy is dividing a large hall into several “identical” bedrooms so that there is no overcrowding. Each bedroom is like a core that does the same function of keeping the guests but are different physically.

Mapper
The map function in Mapper reads row by row of Input File

Combiner
The Combiner wont be called when the call to the Reducer class is not there in Driver class.

Reducer
The Reducer and Combiner need not do the same thing as in case of average of 0 to 100 Numbers.

Input to Mapper

1/1/09 1:26,Product2,1200,Nikki,United States
1/1/09 1:51,Product2,1200,Si,Denmark
1/1/09 10:06,Product2,3600,Irene,Germany
1/1/09 11:05,Product2,1200,Janis,Ireland
1/1/09 12:19,Product2,1200,Marlene,United States
1/1/09 12:20,Product2,3600,seemab,Malta
1/1/09 12:25,Product2,3600,Anne-line,Switzerland
1/1/09 12:42,Product1,1200,ashton,United Kingdom
1/1/09 14:19,Product2,1200,Gabriel,Canada
1/1/09 14:22,Product1,1200,Liban,Norway
1/1/09 16:00,Product2,1200,Toni,United Kingdom
1/1/09 16:44,Product2,1200,Julie,United States
1/1/09 18:32,Product1,1200,Andrea,United States

Output of Mapper

Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {3600} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {3600} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {3600} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product1}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product1}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product2}
Value = Product Price {1200} 	  Product No {1}
----------------------------
Key = Product Date {2009-01} Product Name {Product1}
Value = Product Price {1200} 	  Product No {1}
----------------------------

Magic of Framework happens Here
Input of Combiner

----------------------------
Key = Product Name {2009-01} 	 Product No {Product1}
Values
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
----------------------------
Key = Product Name {2009-01} 	 Product No {Product2}
Values
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
Product Price 3600	
Product No 1
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
Product Price 3600	
Product No 1
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
Product Price 1200	
Product No 1
Product Price 3600	
Product No 1
----------------------------

Values added together in Combiner based on Key

key 2009-01	Product1
productPrice 1200
productNos 1
----------------------------
productPrice 2400
productNos 2
----------------------------
productPrice 3600
productNos 3
----------------------------
key 2009-01	Product2
productPrice 1200
productNos 1
----------------------------
productPrice 2400
productNos 2
----------------------------
productPrice 6000
productNos 3
----------------------------
productPrice 7200
productNos 4
----------------------------
productPrice 8400
productNos 5
----------------------------
productPrice 12000
productNos 6
----------------------------
productPrice 13200
productNos 7
----------------------------
productPrice 14400
productNos 8
----------------------------
productPrice 15600
productNos 9
----------------------------
productPrice 19200
productNos 10
----------------------------

Output of Combiner and Input to reducer

Key = Product Name {2009-01} 	 Product No {Product1}
value = Product Price {3600}	Product Nos {3}
----------------------------
key = Key = Product Name {2009-01} 	 Product No {Product2}
Value = Product Price {19200} Product Nos {10}

Output of Reducer

Key = Product Name {2009-01} 	 Product No {Product1}
Value = AvgVolume {1200}	NoOfRecords {3}
----------------------------
Key = Product Name {2009-01} 	 Product No {Product2}
Value = AvgVolume {1920}	NoOfRecords {10}
----------------------------

CompareTo is used for Object Comparison

The compareTo logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.

Compares this object with the specified object for order. Returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.

Let’s say we would like to compare Jedis by their age:

class Jedi implements Comparable<Jedi> {

    private final String name;
    private final int age;
        //...
}

Then if our Jedi is older than the provided one, you must return a positive, if they are the same age, you return 0, and if our Jedi is younger you return a negative.

public int compareTo(Jedi jedi){
    return this.age > jedi.age ? 1 : this.age < jedi.age ? -1 : 0;
}

By implementing the compareTo method (coming from the Comparable interface) your are defining what is called a natural order. All sorting methods in JDK will use this ordering by default.

There are ocassions in which you may want to base your comparision in other objects, and not on a primitive type. For instance, copare Jedis based on their names. In this case, if the objects being compared already implement Comparable then you can do the comparison using its compareTo method.

public int compareTo(Jedi jedi){
    return this.name.compareTo(jedi.getName());
}

It would be simpler in this case.

Now, if you inted to use both name and age as the comparison criteria then you have to decide your oder of comparison, what has precedence. For instance, if two Jedis are named the same, then you can use their age to decide which goes first and which goes second.

public int compareTo(Jedi jedi){
    int result = this.name.compareTo(jedi.getName());
    if(result == 0){
        result = this.age > jedi.age ? 1 : this.age < jedi.age ? -1 : 0;
    }
    return result;
}

If you had an array of Jedis

Jedi[] jediAcademy = {new Jedi("Obiwan",80), new Jedi("Anakin", 30), ..}

All you have to do is to ask to the class java.util.Arrays to use its sort method.

Arrays.sort(jediAcademy);

This Arrays.sort method will use your compareTo method to sort the objects one by one.

  1. Yes, a combiner can be different to the Reducer, although your Combiner will still be implementing the Reducer interface. Combiners can only be used in specific cases which are going to be job dependent. The Combiner will operate like a Reducer, but only on the subset of the Key/Values output from each Mapper.One constraint that your Combiner will have, unlike a Reducer, is that the input/output key and value types must match the output types of your Mapper.
  2. The primary goal of combiners is to optimize/minimize the number of key value pairs that will be shuffled across the network between mappers and reducers and thus to save as most bandwidth as possible.
  3. The thumb rule of combiner is it has to have the same input and output variable types, the reason for this, is combiner use is not guaranteed, it can or can not be used , depending the volume and number of spills.
  4. The reducer can be used as a combiner when it satisfies this rule i.e. same input and output variable type.
  5. The other most important rule for combiner is it can only be used when the function you want to apply is both commutative and associative. like adding numbers .But not in case like average(if u r using same code as reducer).Combiners can only be used on the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c}