How to fix PySpark, jdk memory -issue?

Question

I seem to have a memory problem using PySpark's ML package. I am Trying to use ALS.fit on a 40 million rows dataframe. Using JDK-11 produced the error:

"java.lang.NoSuchMethodError: sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner"

It worked with 13 million rows, so I guess its a memory cleaning issue.

I tried it using java JDK-8, like proposed here: Apache Spark method not found sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner;

, but I still run into an error, because heap Memory doesnt suffice: I get this error message:

"... java.lang.OutOfMemoryError: Java heap space ..."

Someone has an idea how to circumvent this?

I am using Ubuntu 18.04 LTS and Python 3.6 and PySpark 2.4.2 .

edit:

this is how I patched together my Spark Context configuration:

I have 16 gb of RAM

conf = spark.sparkContext._conf.setAll([
      ("spark.driver.extraJavaOptions","-Xss800M"),
      ("spark.memory.offHeap.enabled", True),
      ("spark.memory.offHeap.size","4g"),
      ('spark.executor.memory', '4g'), 
      ('spark.app.name', 'Spark Updated Conf'),
      ('spark.executor.cores', '2'), 
      ('spark.cores.max', '2'),
      ('spark.driver.memory','6g')])

I'm not sure if this makes sense!

These are the first lines of the error message:

[Stage 8:==================================================>   (186 + 12) / 200]19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_196 in memory! (computed 3.6 MB so far)
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_192 in memory! (computed 5.8 MB so far)
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_192 to disk instead.
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_196 to disk instead.
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_197 in memory! (computed 3.7 MB so far)
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_197 to disk instead.
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_196 in memory! (computed 3.6 MB so far)
[Stage 8:======================================================>(197 + 3) / 200]19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_192 in memory! (computed 5.8 MB so far)
[Stage 9:>                                                        (0 + 10) / 10]19/07/02 14:43:37 WARN BlockManager: Block rdd_40_3 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_40_4 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_40_7 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_3 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_4 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_7 could not be removed as it was not found on disk or in memory
19/07/02 14:43:38 ERROR Executor: Exception in task 7.0 in stage 9.0 (TID 435)
java.lang.OutOfMemoryError: Java heap space
19/07/02 14:43:39 WARN BlockManager: Block rdd_40_5 could not be removed as it was not found on disk or in memory
19/07/02 14:43:38 ERROR Executor: Exception in task 4.0 in stage 9.0 (TID 432)
java.lang.OutOfMemoryError: Java heap space
        at scala.collection.mutable.ArrayBuilder$ofInt.mkArray(ArrayBuilder.scala:327)
[...]

How did you solve this problem? I am having the same problem while saving the dataframe. — Rajjat Dadwal, Feb 06 '20 at 23:08

score 0 · Answer 1 · answered Jul 01 '19 at 13:58

Eventually you probably want to expand memory heap with the help of -Xmx parameter.

You can determine how much memory it needs using various methods. You can simply increase heap until it works, or you can define very large heap and then see how much of it is used and make it proper.

You can monitor heap usage with different ways, for example:

run your application with options to write garbage collection log -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -verbose:gc -Xloggc:/some_path/gc.log
run your application with command line option: -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail and use jcmd utility: jcmd VM.native_memory summary
or some other way, even using graphical utilities, just google it if you need it.

Thanks, how exactly do I use the -XmX parameter? I edited my question with my newly added conf call to spark context. — Luke Hndrch, Jul 02 '19 at 12:22
Unfortunately I know nothing about PySpark. Maybe you should add this parameter to spark.driver.extraJavaOptions, or maybe there is some other PySpark option for that. If you just run java program, you simply add it to the command line. — NickL, Jul 03 '19 at 13:50

How to fix PySpark, jdk memory -issue?

1 Answers1