11

Can we write data to say 100 files, with 10 partitions in each file?

I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Kenny
  • 355
  • 1
  • 5
  • 14

1 Answers1

21

The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).

Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.

If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.

The following will result in 2 files being written out even though it's only got 1 partition:

val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")
Silvio
  • 3,947
  • 21
  • 22
  • 1
    Thanks for the reply. It seems to be one partition in multiple files. What I am looking for is the other way around. The use case is that some time to many files in one folder will cause some overhead for hadoop to search these files, sometimes causes memory issue. – Kenny Jan 08 '18 at 01:55
  • Yes, sorry misunderstood so updated my response to clarify – Silvio Jan 08 '18 at 01:56
  • The hadoop generated data I am working on has 1000 files, but 80000 partitions. Not sure how difficult for spark to have the same ability as hadoop – Kenny Jan 08 '18 at 01:58
  • 1
    What file format are you using and how are you writing the data out? Are you using hive, mapreduce, etc.? Just to make sure we're speaking of the same thing, when you say partitions are you speaking of input splits or hive-style partition folders? Also, the 80000 partitions you see, are those when you read the data back in? That is governed by the HDFS block size. So as long as you write in the same format, same compression, then that would still apply to Spark. – Silvio Jan 08 '18 at 02:05
  • The input files (1000 files) are generated from mapreduce in avro format. I use databricks avro to read. After read in, by checking numOfPartitions, I got about 80000 partitions. The partition I mentioned is spark partition. So do you mean the original num of partition is 1000, but because of my HDFS block size is small, so it breaks into 80000? – Kenny Jan 08 '18 at 02:17
  • 3
    Yes, Avro is splittable. So when Spark reads it in it will split a single file into multiple partitions based on the block size. You can see in the code itself where the `isSplittable` method is `true`: https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L103 – Silvio Jan 08 '18 at 02:23