0

When reading from a hive table and performing a projection and writing this back to HDFS obviously less data is present than in the raw table.

How can I ensure, that the number of files per partition (date) is not very large/i.e. contains a large number of small files?

df.coalesce(200).write.partitionBy(date).parquet('foo)

still outputs many small files. Obviously, I would like to not reduce the paralellism in spark but rather just merge files later on.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • You could use the `parquet-tools` jar with the `hadoop jar` command to merge files. Did you try that? – philantrovert May 14 '18 at 13:08
  • What are your connection settings, how much parallelism is there? – MatBailie May 14 '18 at 13:08
  • spark.executor.instances=250 spark.default.parallelism=1000 conf spark.sql.shuffle.partitions=500, however an initial printschema results in 10000 tasks being started. What I also notice that the Shuffle Spill (Memory) is 7x as large as the actual data being processed. – Georg Heiler May 14 '18 at 13:41
  • http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/combine-small-parquet-files/td-p/33525/page/2 advises against using parquet-tools – Georg Heiler May 14 '18 at 13:44
  • Can't comment on the spill, you've given no code, but if you're at 10000 tasks, you're gonna get 10000+ file most of the time – MatBailie May 14 '18 at 14:28

0 Answers0