0

I have a web server which returns JSON data that I would like to load into an Apache Spark DataFrame. Right now I have a shell script that uses wget to write the JSON data to file and then runs a Java program that looks something like this:

DataFrame df = sqlContext.read().json("example.json");

I have looked at the Apache Spark documentation and there doesn't seem a way to automatically join these two steps together. There must be a way of requesting JSON data in Java, storing it as an object and then converting it to a DataFrame, but I haven't been able to figure it out. Can anyone help?

Daniel Ball
  • 1,613
  • 2
  • 11
  • 14

1 Answers1

1

You could store JSON data into a list of Strings like:

final String JSON_STR0 = "{\"name\":\"0\",\"address\":{\"city\":\"0\",\"region\":\"0\"}}";
final String JSON_STR1 = "{\"name\":\"1\",\"address\":{\"city\":\"1\",\"region\":\"1\"}}";
List<String> jsons = Arrays.asList(JSON_STR0, JSON_STR1);

where each String represents a JSON object.

Then you could transform the list to an RDD:

RDD<String> jsonRDD = sc.parallelize(jsons);

Once you've got RDD, it's easy to have DataFrame:

DataFrame data = sqlContext.read().json(jsonRDD);
Yuan JI
  • 2,927
  • 2
  • 20
  • 29
  • OK this works (apologies for my previous comment, which I have deleted). I used this answer in combination with http://stackoverflow.com/questions/2586975/how-to-use-curl-in-java. I guess what I find a bit confusing is how this works. I would expect the json method to only expect a file directory. Also this method seems a bit memory heavy for very large JSON files as you are constantly recopying the data (http->Java String->RDD->DataFrame) instead of just loading it from file. I'm wandering if Spark has some sort of JSON-over-REST way of talking to a datasource instead. – Daniel Ball Jun 10 '16 at 10:33
  • You are right, loading data into objects take huge memory consumption. I'm looking for the JSON-over-REST solution too. I'll be back once i find the solution – Yuan JI Jun 10 '16 at 12:07