0

I am experiencing memory issues when running PySpark jobs on our cluster with YARN. YARN keeps killing my executors for running out of memory, no matter how much memory I give them, and I cannot understand the reason for that.

Example screenshot:

Screenshot from SparkUI

The amount of data that one task is processing is even slightly lower than the usually recommended 128 MB, and yet it gets killed for exceeding 10 GB (6 GB executor memory + 4 GB overhead). What is going on there?

The only answer that I keep bumping in everywhere I look is to increase the memory allocation even more, but, obviously, there is a physical limit to that at some point (and we do want to run other MapRed/Spark jobs simultaneously), so rather than increasing the memory allocation mindlessly, I would like to understand why so much memory is used.

Any help on that will be greatly appreciated! If you need any additional input from me, I'll be glad to provide it, if I can.

UPDATE: Found the culprit. By carefully restructuring and analyzing the code I managed to localize the code which is running out of memory. I use a rather complicated UDF there. I'll try and rework that bit of code, maybe it will solve the problem.

  • 1
    Please include the code you use ([How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915)) – Alper t. Turker Jun 15 '18 at 11:05
  • Can't do that, because it's not open source. Basically, I am reading two .tsv files to Spark dataframes, joining them together, calculating some values for every entry, then aggregate by key and output. The screenshot was from one of the first two stages (of around 10), so I imagine, there has been not much computation going on there yet. – Yuriy Davygora Jun 15 '18 at 11:10
  • This is coming because you have given too much memory to spark executor, which is exceeding yarn container memory. Try to reduce executor and executor.overhead memory – Kaushal Jun 15 '18 at 12:26
  • Previously I had 3g executor memory with 1g overhead, same problem – Yuriy Davygora Jun 15 '18 at 12:29

0 Answers0