I was trying to read excel files residing on AWS S3. As I already had pyspark pipelines setup, I attempted to use com.crealytics.spark.excel for excel. It worked fine for files <10MB however, with large files (50 to 150 MB excel files) I started getting job failure as follows:
"java.lang.OutOfMemoryError: Java heap space"
I referred to AWS Glue's docs and found the following troubleshooting guide: AWS Glue OOM Heap Space
This, however, only dealt with large number of small file problems, or other driver intensive operations, and the only suggestion it had for my situation is to scale up.
For 50 MB files, I scaled up to 20-30 workers and the job was successful, however, the 150 MB file still could not be read.
I approached the problem with a different toolset i.e. boto3 & pandas or awswrangler. That did the job with just 4 workers in under 10 mins.
I wanted to know if I had done something incorrectly with crealytics, considering pyspark is supposed to be much more powerful, compute wise, considering its distributed nature. Also, if the above result is conclusive, could anyone guide me as to why this happened, based on how both work?