Saving JSON in scala from SparkSQL

Question

I am using Spark SQL for extracting some information from a JSON file. The question is I want to save the result from the SQL analysis into another JSON for plotting it with Plateau or with d3.js. The thing is I don´t know exactly how to do it. Any suggestion?

val inputTable = sqlContext.jsonFile(inputDirectory).cache() inputTable.registerTempTable("inputTable")

val languages = sqlContext.sql("""
        SELECT 
            user.lang, 
            COUNT(*) as cnt
        FROM tweetTable 
        GROUP BY user.lang
        ORDER BY cnt DESC 
        LIMIT 15""")
languages.rdd.saveAsTextFile(outputDirectory + "/lang")
languages.collect.foreach(println)

I don´t mind if I save my data into a .csv file but I don´t know exactly how to do it.

Thanks!

Possible duplicate http://stackoverflow.com/questions/33174443/how-to-save-a-spark-dataframe-as-csv-on-disk/33174577#33174577 — eliasah, Oct 18 '15 at 18:06

score 4 · Answer 1 · edited May 23 '17 at 11:52

4

It is just

val languagesDF: DataFrame = sqlContext.sql("<YOUR_QUERY>")
languagesDF.write.json("your.json")

You do not need to go back to a RDD.

Still, take care, that your JSON will be split into multiple parts. If that is not your intention, read

Save a large Spark Dataframe as a single json file in S3 and
Write single CSV file using spark-csv (here for CSV but can easily be adapted to JSON)

on how to circumvent this (if really required). The main point is in using repartition or coalesce.

edited May 23 '17 at 11:52

Community

1
1

answered Oct 19 '15 at 09:31

Martin Senne

5,939
6
30
47

2

By any chance, do you know if it is possible to avoid the hadoopish format and store data to a file under a s3 key name of my choice instead of the directory with _SUCCES and part-* ? – lisak May 19 '16 at 20:39

Saving JSON in scala from SparkSQL

1 Answers1