0

I am frequently dealing with containers getting killed by YARN for exceeding memory limits. I suspect it has to do with caching/unpersisting RDDS/Dataframes in an inefficient manner.

What is the best way to debug this type of issue?

I have looked at the "Storage" tab in the Spark Web UI, but the "RDD Names" don't get any more descriptive than "MapPartitionsRDD" or "UnionRDD". How do I figure out which specific RDDs take up the most space in the cache?

In order to figure out the Out of Memory errors, I will need to figure out which RDDs are taking up the most space in the cache. I also want to be able to track when they get unpersisted.

B. Smith
  • 1,063
  • 4
  • 14
  • 23

1 Answers1

1
  • For the RDDs you can set meaningful names using setName method:

    val rdd: RDD[T] = ???
    rdd.setName("foo")
    
  • For catalog backed tables:

    val df: DataFrame = ???
    df.createOrReplaceTempView("foo")
    spark.catalog.cacheTable("foo")
    

    the name in the catalog will be reflected in both UI and SparkContext.getPersistentRDD.

  • I am not aware of any solution which works for standalone Datasets.

  • Thank you! Is there a good way to figure out when the RDDs are unpersisted? Or do you basically have to keep refreshing the page? – B. Smith Nov 20 '17 at 22:44