2

Background

I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.

The Problem

Here's the code I'm using to output a .csv file to a folder on my hard drive:

spark_write_csv(d1, "C:/d1.csv")

When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:

enter image description here

The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".

What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?

Edit

A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.

logjammin
  • 1,121
  • 6
  • 21
  • 2
    Repartition the data ---> [Write single CSV file using spark-csv](https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv) – OneCricketeer Aug 10 '21 at 18:41
  • Apologies for not seeing that post -- my searches of relevant prior posts all included the [r] and [sparklyr] tags. Good looking out, sir (or ma'am). – logjammin Aug 10 '21 at 18:58

2 Answers2

2

I had the exact same issue.

In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.

For more details on partitioning, you can check this link.

Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows

table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)

You can check full details of sdf_repartition in the official documentation.

Adriana LE
  • 139
  • 2
  • 9
1

Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.

You can use a method called as coalese to achieve this.

coalesce(df, 1)
Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • Thanks, Mohana. I'm going to try this out when I get back to the office. I'll report back if I encounter any trouble but this looks like it'll work. – logjammin Aug 10 '21 at 18:59
  • When I run `coalesce(d1, 1)`, I get this error: `Error: '..1' must be a vector, not a object.` Did I miss something? – logjammin Aug 10 '21 at 19:42
  • 1
    EDIT: `R` was thinking I wanted `dplyr::coalesce`, when you're referring to a `coalesce` function that's inside Spark. I see. `R`'s spark frontend, `sparklyr`, apparently uses `sdf_coalesce` on the `R` side to call `coalesce` inside Spark. – logjammin Aug 10 '21 at 19:50