PROBLEM DESCRIPTION
Let´s say i've a got a pyspark dataframe, one of its columns is a messy json (not in standard form, single level but with double quotes and #)
+---------+--------------------+--------------------+
|DeskIndex| MetaData| ContentMediaType|
+---------+--------------------+--------------------+
| False|"{""#stateName"":...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""channel"":""C...| text/plain|
| False|"{""channel"":""C...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""#stateName"":...|application/qnd.l...|
| False|"{""#stateName"":...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""#blip.payload...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""channel"":""C...| text/plain|
| False|"{""#stateName"":...| text/plain|
| False|"{""#blip.payload...| text/plain|
| False|"{""#blip.payload...| text/plain|
| False|"{""#stateName"":...|application/qnd.l...|
| False|"{""#stateName"":...|application/qnd.l...|
| False|"{""channel"":""C...| text/plain|
| False|"{""#stateName"":...|application/qnd.l...|
| False|"{""#stateName"":...|application/qnd.l...|
+---------+--------------------+--------------------+
Example of row in metadata column:
"{""#stateName"":""[K.9.2.3]bar"",""#stateId"":""thebar"",""#messageId"":""data:43ff-89e6-62aa235df798"",""#previousStateId"":""80b7-4a5978a03a47"",""#previousStateName"":""[K.9.2.2]foo "",""trace-id"":""84499a247d3%3A1"",""#uniqueId"":""42-9625-58da4a2cbd1f"",""date_created"":""2227"",""#tunnel.originalFrom"":""host-7"",""#tunnel.originalTo"":""1b16@mail.net"",""$originator"":""provider@mail.net"",""$claims"":""Node=provider@mail.net/i-hosted7Identity=provider@mail.netDomainRole=MemberScheme=Transport"",""$internalId"":""58da4a2cbd1f""}"
Is there a way to break this json data in multiple columns?
WHAT I`VE TRIED SO FAR
1 - Most solutions i´ve found require me to declare a schema. But since i don´t the order of the data in each json, how should i work it out?
2- According to solution here, PySpark: Read nested JSON from a String Type Column and create columns,
under the following code i was able to separate, but all columns returned null
df0 = df1.select(F.get_json_object(F.col('MetaData'), "$.#stateName").alias("State Name"),
F.get_json_object(F.col('MetaData'), "$.#stateId").alias("State ID"),
F.get_json_object(F.col('MetaData'), "$.#messageId").alias("Message Id"),
F.get_json_object(F.col('MetaData'), "$.#previousStateId").alias("previousStateId"),
F.get_json_object(F.col('MetaData'), "$.#previousStateName").alias("previousStateName"))
df0.show()
+----------+--------+----------+---------------+-----------------+
|State Name|State ID|Message Id|previousStateId|previousStateName|
+----------+--------+----------+---------------+-----------------+
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
+----------+--------+----------+---------------+-----------------+