Why spark shows difference in number of files and number of of partitions?

Question

When I have queried a table and checked num of partitions available for that df is 14. The data size is more than 10 GB.

But when I see the table location there are 400 part files available which was created by spark while saving df as a table.

Ideally, num of partitions should be equal to num of output files, right?

Can someone please help me understand this scenario?

score -1 · Answer 1 · answered Nov 27 '19 at 14:36

-1

See https://stackoverflow.com/a/53751673/2204206 for quite thorough explanation. And no, there are cases where it's not ideal to have number of partitions equal to number of files (e.g. when files are huge, it might be better to split each into several partitions)

answered Nov 27 '19 at 14:36

Lior Chaga

1,424
2
21
35

Thanks this helps me understand the no of files it creates but still I don't get why there is a difference in partitions while I read and write? While I queried it's 14 but end result is 400. Any clue why it happened? – Abhishek Allamsetty Nov 27 '19 at 14:49
oh, I thought you meant you're reading 400 files and get 14 partitions in spark. So you're saying you read with 14 partitions and get an output of 400? Do you have a join or an aggregation in your query? I suggest you check `spark.sql.shuffle.partitions`, the default is 200. If your query involves a shuffle, than unless you repartition or coalesce afterwards, you'll end up with that amount of partitions (and files) – Lior Chaga Nov 28 '19 at 07:16

Why spark shows difference in number of files and number of of partitions?

1 Answers1