I want to parse a Billionaires JSON dataset into Pig.The JSON file can be found here.
Here is what each entry has:
{
"wealth": {
"worth in billions": 1.2,
"how": {
"category": "Resource Related",
"from emerging": true,
"industry": "Mining and metals",
"was political": false,
"inherited": true,
"was founder": true
},
"type": "privatized and resources"
},
"company": {
"sector": "aluminum",
"founded": 1993,
"type": "privatization",
"name": "Guangdong Dongyangguang Aluminum",
"relationship": "owner"
},
"rank": 1372,
"location": {
"gdp": 0.0,
"region": "East Asia",
"citizenship": "China",
"country code": "CHN"
},
"year": 2014,
"demographics": {
"gender": "male",
"age": 50
},
"name": "Zhang Zhongneng"
}
Attempt 1
I tried loading this data using the following command in grunt :
billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth: (worth in billions:double, how: (category:chararray, from emerging:chararray, industry:chararray, was political:chararray, inherited:chararray, was founder:chararray), type:chararray), company: (sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country code:chararray), year:int, demographics: (gender:chararray,age:int), name:chararray');
This however gives me the error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN
Attempt 2
Next I tried using Twitter's elephantbird project's loader called com.twitter.elephantbird.pig.load.JsonLoader
. Here is the code for this UDF. This is what I did:
billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;
Now this runs and I get no errors! But nothing gets displayed. I get an output like:
Input(s): Successfully read 0 records (1445335 bytes) from: "hdfs://localhost:9000/user/purak/billionaires.json"
Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"
Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_1478889184960_0005
What am I doing wrong here?