2

I'm trying to parse a rather simple json file using Pig and the Twitter's elephant-bird library, but it turns into a very painfull debugging process.

The json has the following structure:

oid_id: (oid:chararray), 
bookmarks: {(
  oid_id:(oid:chararray),
  id:chararray,
  creator: chararray,
  position:chararray,
  creationdate:($ate:chararray)
  )},
lastaction:(date:chararray),
settings:(preferredlanguage:chararray),
userid:chararray

An example of row:

{"oid_id":{"oid":"573239f905474a686e2333f0"},"bookmarks":[{"id":"LEGONINX106W0079264","creator":"player","position":96,"creationdate":{"date":"2016-12-26T09:37:36.916Z"},"oid_id":{"oid":"5860e4e0ca6baf9032edc0d0"}},{"id":"ONEPERCENTMW0128677","creator":"player","position":0.08,"creationdate":{"date":"2018-12-18T15:42:33.956Z"},"oid_id":{"oid":"5c191569faf8474953758930"}}],"lastaction":{"date":"2018-12-18T15:42:28.107Z"},"settings":{"preferredlanguage":"vf","preferredvideoquality":"hd"},"userid":"ocs_32a6ad6dd242d5e3842f9211fd236723_1461773211"}

Here is my code (inspired by this tutorial: https://acadgild.com/blog/determining-popular-hashtags-in-twitter-using-pig)

register /path/to/json-simple-1.1.1.jar 
register /path/to/elephant-bird-core-4.17.jar
register /path/to/elephant-bird-pig-4.17.jar
register /path/to/elephant-bird-hadoop-compat-4.17.jar
define JsonLoaderEB com.twitter.elephantbird.pig.load.JsonLoader;

A = LOAD 'file.json' USING JsonLoaderEB('-nestedLoad=true') as myMap;
describe A;

input_table: { myMap: bytearray }

B = foreach A generate flatten(myMap#'bookmarks') as (bookmark:map[]);
describe B;

B: { bookmark: map[] }

When we dump the above relation, we can see that all the data has been loaded successfully.

([{"oid_id":{"oid":"5860e4e0ca6baf9032edc0d0"},"creator":"player","creationdate":{"date":"2016-12-26T09:37:36.916Z"},"id":"LEGONINX106W0079264","position":96},{"oid_id":{"oid":"5c191569faf8474953758930"},"creator":"player","creationdate":{"date":"2018-12-18T15:42:33.956Z"},"id":"ONEPERCENTMW0128677","position":0.08}])

Now we extract creationdate, creator, id and position from bookmark.

C = foreach B generate bookmark#'creationdate' as date_fact, bookmark#'creator' as creator, bookmark#'id' as id, bookmark#'position' as position;

C: { date_fact: bytearray, creator: bytearray, id: bytearray, position: bytearray }

Dumping the table gives the following error:

Pig Stack Trace

ERROR 1066: Unable to open iterator for alias C. Backend error : Vertex failed, vertexName=scope-41, vertexId=vertex_1542613138136_6721 88_2_00, diagnostics=[Task failed, taskId=task_1542613138136_672188_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1542613138136_672188_2_00_000000_0:org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: C: Store(hdfs://sandbox/tmp/temp-1543074195/tmp277240455:org.apache.pig.impl.io.InterStorage) - sc ope-40 Operator Key: scope-40): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POMapLookUp ( Name: POMapLookUp[bytearray] - scope-28 Operator Key: scope-28) children: null at [null[4,31]]]: java.lang.ClassCastException: java.lan g.String cannot be cast to java.util.Map at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:315) at org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123) at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:376) at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:241) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POMapLookUp (Name: POMapLookUp[byt earray] - scope-28 Operator Key: scope-28) children: null at [null[4,31]]]: java.lang.ClassCastException: java.lang.String cannot be ca st to java.util.Map at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:364) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:406) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:323) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305) 1,9Top

dams
  • 309
  • 1
  • 4
  • 13

1 Answers1

0

Even if it gives a good result for table_extraction relation, it can be from the raw data.

Can u please remove or correct the following object, it looks invalid :

"oid":"5c191393faf8475cb76ee0d5"
54l3d
  • 3,913
  • 4
  • 32
  • 58