Currently trying write large number(7.5 million) of json files from s3 into a few parquet files(128MB) or 3600 records per file with EMR. However the process either takes too long or does not output the correct number of files.
The following is run on EMR 12x c4.8xlarge core nodes, r3.2xlarge master node
df = spark.read.json("s3://BUCKET)
After reading the file into a dataframe. Attempted to write write directly to s3
df.write.parquet("s3://BUCKET")
however this created 200,000 files with only about 36 records per file over 3 hours
I have attempted to repartition and coalesce
df.coalesce(2000).write.parquet("s3://BUCKET")
df.repartition(2000).write.parquet("s3://BUCKET")
however this took around 10 hours
The expected output
- set number of records or a set file size per parquet file
- the whole process should run less than 3 hours
Please suggest any EMR configuration, or spark (python or Scala solutions are welcome) code I should be using
Also if anyone has any idea why EMR is creating so many tasks to begin with that would also be helpful