Using Pyspark 1.6, I'm trying to parse, read and load a given JSON file from an HDFS file location using the Cloudera Hadoop distribution but i keep running into brick walls. Here's what the JSON strings look like after doing a "-cat" in hadoop:
{"json_data":"{\"table\":\"TEST.FUBAR\",\"op_type\":\"I\",\"op_ts\":\"2019-03-14 15:33:50.031848\",\"current_ts\":\"2019-03-14T15:33:57.479002\",\"pos\":\"1111\",\"after\":{\"COL1\":949494949494949494,\"COL2\":99,\"COL3\":2,\"COL4\":\" 99999\",\"COL5\":9999999,\"COL6\":90,\"COL7\":42478,\"COL8\":\"I\",\"COL9\":null,\"COL10\":\"2019-03-14 15:33:49\",\"COL11\":null,\"COL12\":null,\"COL13\":null,\"COL14\":\"x222263 \",\"COL15\":\"2019-03-14 15:33:49\",\"COL16\":\"x222263 \",\"COL17\":\"2019-03-14 15:33:49\",\"COL18\":\"2020-09-10 00:00:00\",\"COL19\":\"A\",\"COL20\":\"A\",\"COL21\":0,\"COL22\":null,\"COL23\":\"2019-03-14 15:33:47\",\"COL24\":2,\"COL25\":2,\"COL26\":\"R\",\"COL27\":\"2019-03-14 15:33:49\",\"COL28\":\" \",\"COL29\":\"PBU67H \",\"COL30\":\" 20000\",\"COL31\":2,\"COL32\":null}}"}
I then try to create a dataframe from the json file using:
df = sqlContext.read.json("test_data.json")
I then run my pyspark script and I insert the dataframe into a temp hive table in parquet format which looks like this when selecting:
+------------------+
| json_data|
+------------------+
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
+------------------+
only showing top 20 rows
I need the all the JSON attributes and values to display in a row and column format but it's displaying as one long string. Since my work is done thru a proxy with rigorous firewall settings i'm unable to currently use the 'Hive-JSON-Serde' jar to parse it as json (which would really solve a bunch of my problems). I tried doing a posexplode or lateral view explode but those are for struct types not an ArrayBuffer type.
Without installing any 3rd party software or changing the Spark SQL execution engine or any other admin settings on the cluster (since i'm a normal user-loser) is there any work-around for Cloudera 5.x to parse this string as JSON or insert the values into de-normalized tables with rows, columns, etc. using pyspark?