How do I parse a improperly formatted JSON string in Spark 1.6 that has blank spaces, extra double quotes and backslashes using Python?

Question

Using Pyspark 1.6, I'm trying to parse, read and load a given JSON file from an HDFS file location using the Cloudera Hadoop distribution but i keep running into brick walls. Here's what the JSON strings look like after doing a "-cat" in hadoop:

{"json_data":"{\"table\":\"TEST.FUBAR\",\"op_type\":\"I\",\"op_ts\":\"2019-03-14 15:33:50.031848\",\"current_ts\":\"2019-03-14T15:33:57.479002\",\"pos\":\"1111\",\"after\":{\"COL1\":949494949494949494,\"COL2\":99,\"COL3\":2,\"COL4\":\"            99999\",\"COL5\":9999999,\"COL6\":90,\"COL7\":42478,\"COL8\":\"I\",\"COL9\":null,\"COL10\":\"2019-03-14 15:33:49\",\"COL11\":null,\"COL12\":null,\"COL13\":null,\"COL14\":\"x222263 \",\"COL15\":\"2019-03-14 15:33:49\",\"COL16\":\"x222263 \",\"COL17\":\"2019-03-14 15:33:49\",\"COL18\":\"2020-09-10 00:00:00\",\"COL19\":\"A\",\"COL20\":\"A\",\"COL21\":0,\"COL22\":null,\"COL23\":\"2019-03-14 15:33:47\",\"COL24\":2,\"COL25\":2,\"COL26\":\"R\",\"COL27\":\"2019-03-14 15:33:49\",\"COL28\":\"  \",\"COL29\":\"PBU67H   \",\"COL30\":\"            20000\",\"COL31\":2,\"COL32\":null}}"}

I then try to create a dataframe from the json file using:

df = sqlContext.read.json("test_data.json")

I then run my pyspark script and I insert the dataframe into a temp hive table in parquet format which looks like this when selecting:

+------------------+
|         json_data|
+------------------+
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
|{"table":"TEST....|
+------------------+
only showing top 20 rows

I need the all the JSON attributes and values to display in a row and column format but it's displaying as one long string. Since my work is done thru a proxy with rigorous firewall settings i'm unable to currently use the 'Hive-JSON-Serde' jar to parse it as json (which would really solve a bunch of my problems). I tried doing a posexplode or lateral view explode but those are for struct types not an ArrayBuffer type.

Without installing any 3rd party software or changing the Spark SQL execution engine or any other admin settings on the cluster (since i'm a normal user-loser) is there any work-around for Cloudera 5.x to parse this string as JSON or insert the values into de-normalized tables with rows, columns, etc. using pyspark?

Guy · Accepted Answer · 2019-03-19T14:14:43.310

The "json_data" content is actually a string and not json which has built in schema structures like arrays, maps, and structs. My problem is with the extraneous double quotes (") around the actual guts of the "json_data" that was causing a problem when Spark was trying to read it. Sample:

{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}"}

I used this sample after I removed the double-quotes:

{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":"            99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":"  ","COL29":"PBU67H   ","COL30":"            20000","COL31":2,"COL32":null}}}

I'll probably have to use some regEx or some kind of function utility to remove the double-quotes around the data. But after modifying it and running pyspark I got this:

    Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkContext available as sc, HiveContext available as sqlContext.
>>> filePath = "/user/no_quote_json.json"
>>> df = sqlContext.read.json(filePath)
>>> df.printSchema()
root
 |-- json_data: struct (nullable = true)
 |    |-- after: struct (nullable = true)
 |    |    |-- COL1: long (nullable = true)
 |    |    |-- COL10: string (nullable = true)
 |    |    |-- COL11: string (nullable = true)
 |    |    |-- COL12: string (nullable = true)
 |    |    |-- COL13: string (nullable = true)
 |    |    |-- COL14: string (nullable = true)
 |    |    |-- COL15: string (nullable = true)
 |    |    |-- COL16: string (nullable = true)
 |    |    |-- COL17: string (nullable = true)
 |    |    |-- COL18: string (nullable = true)
 |    |    |-- COL19: string (nullable = true)
 |    |    |-- COL2: long (nullable = true)
 |    |    |-- COL20: string (nullable = true)
 |    |    |-- COL21: long (nullable = true)
 |    |    |-- COL22: string (nullable = true)
 |    |    |-- COL23: string (nullable = true)
 |    |    |-- COL24: long (nullable = true)
 |    |    |-- COL25: long (nullable = true)
 |    |    |-- COL26: string (nullable = true)
 |    |    |-- COL27: string (nullable = true)
 |    |    |-- COL28: string (nullable = true)
 |    |    |-- COL29: string (nullable = true)
 |    |    |-- COL3: long (nullable = true)
 |    |    |-- COL30: string (nullable = true)
 |    |    |-- COL31: long (nullable = true)
 |    |    |-- COL32: string (nullable = true)
 |    |    |-- COL4: string (nullable = true)
 |    |    |-- COL5: long (nullable = true)
 |    |    |-- COL6: long (nullable = true)
 |    |    |-- COL7: long (nullable = true)
 |    |    |-- COL8: string (nullable = true)
 |    |    |-- COL9: string (nullable = true)
 |    |-- current_ts: string (nullable = true)
 |    |-- op_ts: string (nullable = true)
 |    |-- op_type: string (nullable = true)
 |    |-- pos: string (nullable = true)
 |    |-- table: string (nullable = true)

>>> df.select("json_data.after.col29").show()
+---------+
|    col29|
+---------+
|PBU67H   |
+---------+

Once it's in this format all the the other code I've done to flatten out the struct objects and insert into a flattened Hive table all the other logic I have will work. Hopefully this helps everyone in the future if you run into a similar problem.

How do I parse a improperly formatted JSON string in Spark 1.6 that has blank spaces, extra double quotes and backslashes using Python?

1 Answers1