I am experiencing memory issues when running PySpark jobs on our cluster with YARN. YARN keeps killing my executors for running out of memory, no matter how much memory I give them, and I cannot understand the reason for that.
Example screenshot:
The amount of data that one task is processing is even slightly lower than the usually recommended 128 MB, and yet it gets killed for exceeding 10 GB (6 GB executor memory + 4 GB overhead). What is going on there?
The only answer that I keep bumping in everywhere I look is to increase the memory allocation even more, but, obviously, there is a physical limit to that at some point (and we do want to run other MapRed/Spark jobs simultaneously), so rather than increasing the memory allocation mindlessly, I would like to understand why so much memory is used.
Any help on that will be greatly appreciated! If you need any additional input from me, I'll be glad to provide it, if I can.
UPDATE: Found the culprit. By carefully restructuring and analyzing the code I managed to localize the code which is running out of memory. I use a rather complicated UDF there. I'll try and rework that bit of code, maybe it will solve the problem.