Loading CSV File using Pig Script
A = LOAD '...' USING PigStorage(',') AS (...);
Filtering NULL values in chararray
Example1: null as chararray
input.txt
1,2014-04-08 12:09:23.0
2,2014-04-08 12:09:23.0
3,null
4,null
Pig:
A = LOAD 'input.txt' USING PigStorage(',') AS (f1:int,f2:chararray);
B = FILTER A BY f2!='null';
DUMP B;
Example2: Real null value
input.txt
1,2014-04-08 12:09:23.0
2,2014-04-08 12:09:23.0
3,
4,
Pig:
A = LOAD 'input.txt' USING PigStorage(',') AS (f1:int,f2:chararray);
B = FILTER A BY f2 is not null;
DUMP B;
Output:
(1,2014-04-08 12:09:23.0)
(2,2014-04-08 12:09:23.0)
Finding Max from CSV File
Test.csv
Maruthi,10
Maruthi,55
Suziki,50
Honda,4
Maruthi,40
Suziki,60
Honda,14
BMW,140
Benz,5
a1 = LOAD 'Test.csv' USING PigStorage(',') AS (Car:chararray, No:int);
DESCRIBE a1;
Output
a1: {Car: chararray,No: int}
b1 = GROUP a1 BY Car;
DESCRIBE b1;
b1: {group: chararray,a1: {(Car: chararray,No: int)}}
DUMP b1;
Output
(BMW,{(BMW,140)})
(Benz,{(Benz,5)})
(Honda,{(Honda,4),(Honda,14)})
(Suziki,{(Suziki,50),(Suziki,60)})
(Maruthi,{(Maruthi,10),(Maruthi,55),(Maruthi,40)})
(,{(,)})
c1 = FOREACH b1 GENERATE group, MAX(a1.No);
Output
(BMW,140)
(Benz,5)
(Honda,14)
(Suziki,60)
(Maruthi,55)
Filtering Empty Records
Corrpted Record tsv content
HouseHold,Soap,2
Kitchen,Oil,2
HouseHold,Sweeper,2
PoojaItems,Sandal
Kitchen,Rice,30
HouseHold,,1
Kitchen,Sugar,5
HouseHold,Shampoo,2
PoojaItems,Champor,10
HouseHold,Soap,2
filtered_records = FILTER records BY Item is null OR No is null;
Getting Count of Corrupted Records
records = LOAD 'Test.csv' USING PigStorage(',') AS (Type:chararray, Item:chararray, No:int);
filtered_records = FILTER records BY Item is null OR No is null;
grouped_records = GROUP filtered_records BY Type;
DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {(Type: chararray,Item: chararray,No: int)}}
corrupt_records = FOREACH grouped_records GENERATE group , COUNT(filtered_records);
(HouseHold,1)
(PoojaItems,1)
Writing macros to find Maximum Item Sold
DEFINE max_item_sold(Records, Type, No) RETURNS c
{
b = GROUP $Records BY $Type;
$c = FOREACH b GENERATE group, MAX($Records.$No);
};
max_type_sold = max_item_sold(records, Type, No);