1

Here is the code that I am using to write dataframe to JSON. I am running this code from zeppelin:

val df = Seq((2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
df.write.json("/tmp/out.json")

What I expect is dataframe data written in /tmp/out.json file. However it is creating directory with name "/tmp/out.json" and inside that I find following two files:

_SUCCESS  
._SUCCESS.crc

None of these file is having JSON data. What am I missing here?

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
vatsal mevada
  • 5,148
  • 7
  • 39
  • 68
  • 1
    Are you running a cluster or just locally? If cluster have you checked the output directory on your executors, as opposed to on the driver machine? – ImDarrenG Dec 02 '16 at 09:05
  • @ImDarrenG I can see json data on executor. And it is partitioned on executors. Is there any way to get all the json data on a single json file? – vatsal mevada Dec 02 '16 at 09:29
  • Yes, it's possible, see: http://stackoverflow.com/a/40594798/7098262 – Mariusz Dec 02 '16 at 10:33

1 Answers1

0

You have some options:

  • Write to a shared location and merge the files (not using Spark to do the merge)
  • df.rdd.collect() the data to the driver and write to a file. You will use standard scala io libraries so there won't be any partitioning. This has the disadvantage of having to pull all the data from executors to the driver, which could be slow or infeasible depending on the amount of data and driver resources.
  • Better approach than collecting the whole dataset would be to collect each partition in turn, and stream the data to a single file on the driver

e.g.:

val rdd = df.rdd
for (p <- rdd.partitions) {
    val idx = p.index
    val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
    //The second argument is true to avoid rdd reshuffling
    val data = partRdd.collect //data contains all values from a single partition 
                               //in the form of array
    //Now you can do with the data whatever you want: iterate, save to a file, etc.
}

https://stackoverflow.com/a/21801828/4697497

Community
  • 1
  • 1
ImDarrenG
  • 2,315
  • 16
  • 24