1

The question is all there is. I want a way to check the java runtime variables for the executor jvm created but I am working with pyspark. How can I access java.lang.Runtime.getRuntime().maxMemory() if I am working with pyspark?

based on the comment I have tried to run the following code but both approaches are unsuccessful

#created a RDD
l = sc.range(100)

Now, I have to run func = sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory() on each executor. So, I do the following

l.map(lambda x:sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory()).collect()

Which results in

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

The spark context can only be used on the driver

I also tried

func = sc._gateway.jvm.java.lang.Runtime.getRuntime()
l.map(lambda x:func.maxMemory()).collect()

which results in the following error

TypeError: cannot pickle '_thread.RLock' object
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Is it for the driver jvm or for workers? – ernest_k Dec 20 '22 at 11:28
  • Workers. Sorry for not making that clear – figs_and_nuts Dec 20 '22 at 11:29
  • 1
    You may want to take a look at this answer: https://stackoverflow.com/a/35725213/5761558. I was able to do something like `func = sc._gateway.jvm.java.lang.Runtime.getRuntime()` followed by `func.maxMemory()` and that returned the max memory. You just need to orchestrate that so it runs on workers (maybe using a udf or other rdd distributed calls) – ernest_k Dec 20 '22 at 11:37
  • I'm finding it difficult to run that. Modified the question describing the bottlenecks – figs_and_nuts Dec 20 '22 at 18:11

0 Answers0