Why after creating 5 partitions and 10 buckets, number of data files created at the backend is so high?

Asked Jan 07 '21 at 15:52

Active Jan 07 '21 at 15:54

Viewed 105 times

Below is my code:

spark.range(1,10000).withColumn("hashId",col("id")%5).write.partitionBy("hashId").bucketBy(10,"id").saveAsTable("learning.test_table")

Spark Configuration:

./spark-shell --master yarn --num-executors 10 --executor-cores 3 --executor-memory 5

There are 5 partitions and inside each partition, there are 61 files:

hdfs dfs -ls /apps/hive/warehouse/learning.db/test_table/hashId=0 | wc -l
61

After creating this table when I checked the backend, it created 305 files + 1 _SUCCESS file.

Could someone please explain why it is creating 305 files?

edited Jan 07 '21 at 15:54

Yevhen Kuzmovych

asked Jan 07 '21 at 15:52

Shikha C

2

5 partitions * 61 files/partition = 305 files in total...? – mck Jan 07 '21 at 15:54
or are you asking why there are 61 files / partition, rather than 10 – mck Jan 07 '21 at 15:55
yes why 61 files in each partition and also I want to understand how the spark configurations are impacting the number of files generated – Shikha C Jan 07 '21 at 15:58
1

Does this answer your question? [Why is Spark saveAsTable with bucketBy creating thousands of files?](https://stackoverflow.com/questions/48585744/why-is-spark-saveastable-with-bucketby-creating-thousands-of-files) – mck Jan 07 '21 at 16:03
Thanks for the link, I got an idea of how we can prevent these number of files, but can anybody explain to me the calculations behind getting 305 files. – Shikha C Jan 08 '21 at 05:07

0 Answers0