Below is my code:
spark.range(1,10000).withColumn("hashId",col("id")%5).write.partitionBy("hashId").bucketBy(10,"id").saveAsTable("learning.test_table")
Spark Configuration:
./spark-shell --master yarn --num-executors 10 --executor-cores 3 --executor-memory 5
There are 5 partitions and inside each partition, there are 61 files:
hdfs dfs -ls /apps/hive/warehouse/learning.db/test_table/hashId=0 | wc -l
61
After creating this table when I checked the backend, it created 305 files + 1 _SUCCESS file.
Could someone please explain why it is creating 305 files?