1

I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.

Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?

UPDATE 1: Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?

iMajna
  • 489
  • 4
  • 28
  • If those files are empty or the size is small < 15mb, then indeed you need to repartition data but in all cases we can't give you an exact answer with the given information. Please read about how to ask question on SO https://stackoverflow.com/help/how-to-ask ! – eliasah Jul 12 '17 at 13:07
  • There is also a shell utility called `parquet-tools` which you can use, but repartitioning should do it for you. – philantrovert Jul 12 '17 at 14:04

2 Answers2

2

Take a look at this GitHub repo and this answer. In short keep size of the files larger than HDFS block size (128MB, 256MB).

gorros
  • 1,411
  • 1
  • 18
  • 29
  • Is it possible to make that big parquet file in streaming (every 2seconds) where in every 2sec I'm receiving around 1,000 rows. Don't have a vision pump up parquet file and then store him on Hadoop. Aprox. as I have experience 12MB parquet file (snappy) has around 20M rows in it. You have any more links to share which can help me, I would be grateful :) Never mind, I just read about this GitHub repo :) – iMajna Jul 12 '17 at 17:58
  • Just run other job which will compact files with means of mentioned tool or just implement it yourself with `coalesce`. – gorros Jul 12 '17 at 18:04
  • You made my day :) – iMajna Jul 12 '17 at 18:04
1

A good way to reduce the number of output files is to use coalesce or repartition.

Luis
  • 503
  • 5
  • 11