I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.
Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?
UPDATE 1: Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?