Description At my work place we have a large amount of data that needs processing. It concerns a rapidly growing amount of instances (currently ~3000) which all have a few megabytes worth of data stored in gzipped csv files on S3.
I have setup a spark cluster and wrote a spark script that does the following.
For every instance:
- load the data frame
- run calculations
- but does not save the dataframe yet ( so no action is triggered, which I confirmed in the spark job UI )
Afterwards, I combine all the data frames into a single dataframe and save the result (therefore triggering an action)
Problem The above works perfectly fine when I use a small amount of instances. But I found the following problems: - when an instance file is loaded into a data frame it takes 4-6 seconds without triggering any action. - the loading of the dataframes happens on the driver - because of the above two it takes nearly 2 hours to load the data frames ( optimized this a bit by using python "threads"
Could someone explain me what is causing the slow loading and advice me how to deal with this?
Maybe relevant information is that I am using the aws s3a hadoop file system. Also the first part of my calculation is completely standalone per instance, which is why I am a bit hesitant to combine all the input data into one gzipped csv file among other reasons.
Any help would be greatly appreciated, I am writing this after breaking my head over this problem until 5 in the night.
Please let me know if I should provide more details.
Edit
Thanks for the comments, I am running spark on kubernetes so I can't merge the files using hadoop commands. However I am pursuing the idea of merging the instance files.
Edit 2 Turns out I was using spark in the complety wrong way, I thought I would make it easier for spark by keeping the data separate, however that backfired. The best solution seems to aggregate your input files into larger ones. And adapt your script to keep them separate.