com.crealytics.spark.excel vs pandas/awswrangler

Question

I was trying to read excel files residing on AWS S3. As I already had pyspark pipelines setup, I attempted to use com.crealytics.spark.excel for excel. It worked fine for files <10MB however, with large files (50 to 150 MB excel files) I started getting job failure as follows:

"java.lang.OutOfMemoryError: Java heap space"

I referred to AWS Glue's docs and found the following troubleshooting guide: AWS Glue OOM Heap Space

This, however, only dealt with large number of small file problems, or other driver intensive operations, and the only suggestion it had for my situation is to scale up.

For 50 MB files, I scaled up to 20-30 workers and the job was successful, however, the 150 MB file still could not be read.

I approached the problem with a different toolset i.e. boto3 & pandas or awswrangler. That did the job with just 4 workers in under 10 mins.

I wanted to know if I had done something incorrectly with crealytics, considering pyspark is supposed to be much more powerful, compute wise, considering its distributed nature. Also, if the above result is conclusive, could anyone guide me as to why this happened, based on how both work?

score 0 · Answer 1 · answered Jun 13 '23 at 12:36

com.crealytics.spark.excel is no doubt the powerful one here. This same instance happened to me once and what I have done at that time was I allocated more memory to JVM by increasing heap space. You can do this by modifying JVM arguments while running the job that resolved my issue at that time, You can replace -Xmx to -Xmx4g this will set heap memory size to 4gigabytes.You can refer to this for more details Increase heap size in Java

score 0 · Answer 2 · answered Jun 13 '23 at 12:52

0

This error is related to insufficiency of memory allocation within the Java Virtual Machine (JVM). It can happen when an application or service requires more memory than is allocated to it. To resolve this issue, it is recommended to engage the expertise of the Cloudera Administration team who can rectify the memory allocation problem, ensuring optimal performance and stability of the system.

answered Jun 13 '23 at 12:52

Ahmad 1.0

1
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 14 '23 at 11:10

com.crealytics.spark.excel vs pandas/awswrangler

2 Answers2