0

Below is my code:

spark.range(1,10000).withColumn("hashId",col("id")%5).write.partitionBy("hashId").bucketBy(10,"id").saveAsTable("learning.test_table")

Spark Configuration:

./spark-shell --master yarn --num-executors 10 --executor-cores 3 --executor-memory 5

There are 5 partitions and inside each partition, there are 61 files:

hdfs dfs -ls /apps/hive/warehouse/learning.db/test_table/hashId=0 | wc -l
61

After creating this table when I checked the backend, it created 305 files + 1 _SUCCESS file.

Could someone please explain why it is creating 305 files?

Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
Shikha C
  • 23
  • 6
  • 2
    5 partitions * 61 files/partition = 305 files in total...? – mck Jan 07 '21 at 15:54
  • or are you asking why there are 61 files / partition, rather than 10 – mck Jan 07 '21 at 15:55
  • yes why 61 files in each partition and also I want to understand how the spark configurations are impacting the number of files generated – Shikha C Jan 07 '21 at 15:58
  • 1
    Does this answer your question? [Why is Spark saveAsTable with bucketBy creating thousands of files?](https://stackoverflow.com/questions/48585744/why-is-spark-saveastable-with-bucketby-creating-thousands-of-files) – mck Jan 07 '21 at 16:03
  • Thanks for the link, I got an idea of how we can prevent these number of files, but can anybody explain to me the calculations behind getting 305 files. – Shikha C Jan 08 '21 at 05:07

0 Answers0