1

I am studying Sparkr. I have a csv file:

a <- read.df(sqlContext,"./mine/a2014.csv","csv")

I want to use write.df to store this file. However, when I use:

write.df(a,"mine/a.csv")

I get a folder called a.csv, in which there is no csv file at all.

halfer
  • 19,824
  • 17
  • 99
  • 186
Feng Chen
  • 2,139
  • 4
  • 33
  • 62
  • Are there any files in the folder, or is it completely empty? – sgvd May 23 '16 at 16:01
  • the folder a.csv includes 5 files: _common_metadata, _metadata, _SUCCESS and two more with very long names. But none of them can be opened by double click. When I try to open them, I got the information like this: Could not display “_common_metadata”. The file is of an unknown type. By the way. All of this happens on linux using vm virtualbox – Feng Chen May 24 '16 at 14:32

1 Answers1

2

Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.

The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv extension and double clicking them, or with something like head longfilename.

You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS files you mentioned.

If you do need all the data in one file, you can do that by using repartition to reduce the amount of partitions to 1 and then write it:

b <- repartition(a, 1)
write.df(b,"mine/b.csv")

This will result in just one long-named file which is a CSV file with all the data.

(I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce rather than repartition but I couldn't find an equivalent SparkR function)

sgvd
  • 3,819
  • 18
  • 31
  • Thanks a lot for you answer. I learn a lot. Just one thing: When I try to open the file with the long name. There is just a pile of nonsense characters in it. – Feng Chen May 24 '16 at 16:12
  • I saw now that SparkR writes dataframes in Parquet format when using `write.df`. You have to specify to write in CSV format specifically. You can try `write.df(b,"mine/b.csv", "csv")`, analogous to how you read it, or maybe you have to specify the full format specification as described in http://stackoverflow.com/a/34922656/1737727 (again I don't actually use SparkR myself). – sgvd May 24 '16 at 16:29
  • Thanks a lot! I still cannot figure out this. But I know how to use write.text and read.text to do this. So it is ok. – Feng Chen May 25 '16 at 13:15