Pyspark: Reading JSON data file with no separator between objects

Question

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

Fix the upstream process? Anything you'll do in Spark will be at least somewhat inefficient and ugly. — zero323, Jan 12 '18 at 01:58
makes sense. but i would like to know the rdd based approach to solve this. or if there is any better approach ofcourse. — sjishan, Jan 12 '18 at 01:59
Off the top of my head: you can use `wholTextFiles` and parse manually - but it is bad performance wise. You can try to use [Hadoop Input format with delimiter](https://stackoverflow.com/q/31227363/6910411) if structure is always delimited by `}{`, and then fix records, but it is hack. You can implement your own input format, but not in Python, and it is a lot of code for such a problem. But honestly - if process is under your control, don't waste time on fixing the symptoms, fix the problem :) — zero323, Jan 12 '18 at 02:04

score 9 · Accepted Answer · answered Jan 13 '18 at 02:16

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

Hi, my data is structured as follows, what might you recommend if I wanted restaurant id as values in one column, latitude and longitude in other columns? Thanks! ===> [{"restaurant_id": "1234", "infos": [{"timestamp": "2020-02-03T00:57:26.000Z", "longitude": "-123, "latitude": "456"}{"restaurant_id": "5678", "infos":[{"timestamp": "2.... — DrDEE, Mar 20 '20 at 00:57

Pyspark: Reading JSON data file with no separator between objects

1 Answers1