9

I'm relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contains a list of dictionaries). The resulting RDD would then, roughly speaking, contain all of the lists of dictionaries combined into a single list of dictionaries. I haven't been able to find this in the documentation (https://spark.apache.org/docs/1.2.0/api/python/pyspark.html), but if I missed it please let me know.

So far I tried reading the JSON files and creating the combined list in Python, then using sc.parallelize(), however the entire dataset is too large to fit in memory so this is not a practical solution. It seems like Spark would have a smart way of handling this use case, but I'm not aware of it.

How can I create a single RDD in Python comprising the lists in all of the JSON files?

I should also mention that I do not want to use Spark SQL. I'd like to use functions like map, filter, etc., if that's possible.

Brandt
  • 565
  • 1
  • 6
  • 13

4 Answers4

6

Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:

my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)

If there's a better or more efficient way to do this, please let me know, but this seems to work.

zengr
  • 38,346
  • 37
  • 130
  • 192
Brandt
  • 565
  • 1
  • 6
  • 13
2

You can use sqlContext.jsonFile() to get a SchemaRDD (which is an RDD[Row] plus a schema) that can then be used with Spark SQL. Or see Loading JSON dataset into Spark, then use filter, map, etc for a non-SQL processing pipeline. I think you may have to unzip the files, and also Spark can only work with files where each line is a single JSON document (i.e., no multiline objects possible).

Community
  • 1
  • 1
tgpfeiffer
  • 1,698
  • 2
  • 18
  • 22
  • Thanks for answering. I should have mentioned that I don't want to use Spark SQL, I want to use a non-SQL processing pipeline like in the question you referenced. I will update my original question. The answer to the question you referenced appears to be in Scala, not Python. Thanks again for your help though! – Brandt Jan 29 '15 at 01:20
  • 1
    Right, it's in Scala, but the idea can be applied to your problem: Load the input data set using `sparkContext.textFile()` (which actually [seems to support gzipped files](http://stackoverflow.com/questions/16302385/gzip-support-in-spark)), then parse the string lines with a parser of your choice (such as [the json module](https://docs.python.org/2/library/json.html)), then process as you wish. – tgpfeiffer Jan 29 '15 at 01:58
  • Thanks, this worked! The key step was using the map function on json.loads. I'll post exactly what I did as an answer. Thanks a lot for your help. – Brandt Jan 29 '15 at 18:20
1

You can load a directory of files into a single RDD using textFile and it also supports wildcards. That wouldn't give you file names, but you don't seem to need them.

You can use Spark SQL while using basic transformations like map, filter etc. SchemaRDD is also an RDD (in Python, as well as Scala)

pzecevic
  • 2,807
  • 22
  • 21
1

To load list of Json from a file as RDD:

def flat_map_json(x): return [each for each in json.loads(x[1])]   
rdd = sc.wholeTextFiles('example.json').flatMap(flat_map_json)
user3190018
  • 890
  • 13
  • 26
Supritha P
  • 826
  • 8
  • 7