How to use pyspark to fix malformed JSON before it is read into a dataframe

Question

I am attempting to read a MongoDB dump into a dataframe. The dump is in JSON format, except for a Date element. Here is a sample piece of the JSON:

{
 "_id": {
  "$binary": "AAAB92tW4kSWbIyLJj/zWg==",
  "$type": "03"
 },
 "_t": "VisitData",
 "ContactId": {
  "$binary": "qc4p+OQsEUumAtDWxvSZuA==",
  "$type": "03"
 },
 "StartDateTime": Date(1541452223793),
 "EndDateTime": Date(1541452682373),
 "SaveDateTime": Date(1541453891548),
 "ChannelId": {
...

I'd like to get the date into a valid format so that I can re-read it into a dataframe correctly.

I tried reading the file in as one large string, but that failed miserably, as I think the file is too large. I also tried reading it in as a CSV, which does work in as far as it does create a dataframe, but the columns are all over the place, and I'm not sure what to do with it after that to get to valid JSON. Plus it just seems like the wrong way to go about it.

Essentially, I'm not sure how to go about "pre-processing" the file in pyspark. Suggestions on the right way to do this are much needed.

score 0 · Accepted Answer · answered May 14 '19 at 22:35

My recommendation to you would be to fix that malformed Date part into proper JSON format with a cleaner script in python then use spark.read.json(path) to read in the fixed JSON file (if you intend on use pyspark).

I don't know how big that data dump is, but for the cleaning you probably want to do something like this: https://stackoverflow.com/a/18515887/11388628

You could use readline() to read in your malformed json:

output = open("path\\filename.json","r")
output.readline()

Clean as required, save the new JSON, then read it into a pyspark with spark.read.json(path).

How to use pyspark to fix malformed JSON before it is read into a dataframe

1 Answers1