Get JSON into Apache Spark from a web source in Java

Question

I have a web server which returns JSON data that I would like to load into an Apache Spark DataFrame. Right now I have a shell script that uses wget to write the JSON data to file and then runs a Java program that looks something like this:

DataFrame df = sqlContext.read().json("example.json");

I have looked at the Apache Spark documentation and there doesn't seem a way to automatically join these two steps together. There must be a way of requesting JSON data in Java, storing it as an object and then converting it to a DataFrame, but I haven't been able to figure it out. Can anyone help?

score 1 · Accepted Answer · answered Jun 08 '16 at 14:25

1

You could store JSON data into a list of Strings like:

final String JSON_STR0 = "{\"name\":\"0\",\"address\":{\"city\":\"0\",\"region\":\"0\"}}";
final String JSON_STR1 = "{\"name\":\"1\",\"address\":{\"city\":\"1\",\"region\":\"1\"}}";
List<String> jsons = Arrays.asList(JSON_STR0, JSON_STR1);

where each String represents a JSON object.

Then you could transform the list to an RDD:

RDD<String> jsonRDD = sc.parallelize(jsons);

Once you've got RDD, it's easy to have DataFrame:

DataFrame data = sqlContext.read().json(jsonRDD);

answered Jun 08 '16 at 14:25

Yuan JI

2,927
2
20
29

OK this works (apologies for my previous comment, which I have deleted). I used this answer in combination with http://stackoverflow.com/questions/2586975/how-to-use-curl-in-java. I guess what I find a bit confusing is how this works. I would expect the json method to only expect a file directory. Also this method seems a bit memory heavy for very large JSON files as you are constantly recopying the data (http->Java String->RDD->DataFrame) instead of just loading it from file. I'm wandering if Spark has some sort of JSON-over-REST way of talking to a datasource instead. – Daniel Ball Jun 10 '16 at 10:33
You are right, loading data into objects take huge memory consumption. I'm looking for the JSON-over-REST solution too. I'll be back once i find the solution – Yuan JI Jun 10 '16 at 12:07

Get JSON into Apache Spark from a web source in Java

1 Answers1

Linked