5

I tried to debug a very simple Spark scala word count program. Since spark is "lazy" so I think I need to put the break point at an "action" statement and then run that line of code, then I'll be able to check those RDD variables before that statements and look at their data. So I put a break point at line 14, when debugging gets there, I hit step over to run line 14. However after doing that, I cannot see/find any data for varaibles text1, text2 in the debug session variable view.(But I can see data inside the "all" variable in the debug view though). Am I doing this right? Why I cannot see data in the text1/text2 variables ?

Suppose my wordCount.txt is like this:

This is a text file with words aa aa bb cc cc

I expect to see (aa,2),(bb,1),(cc,2) etc somewhere in text2 variable view. But I don't find anything like that in there. See screen shot below the codes.

I am using eclipse Neon and Spark2.1 and it is a eclipse local debug session. Your help would be really appreciated as I cannot get any info after extensive search. Here's my code:

package Big_Data.Spark_App 

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def main(args: Array[String]){
    val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
    val sc = new SparkContext(conf)    
    val text = sc.textFile("/home/cloudera/Downloads/wordCount.txt")
    val text1 = text.flatMap(rec=>rec.split(" ")).map(rec=>(rec,1))
    val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache

    val all = text2.collect()  //line 14
    all.foreach(println)           
  }
}

Here's the debug variable view shows that no actual data in text2 variable

Jerry
  • 53
  • 1
  • 6

2 Answers2

3

Spark evaluates lazily. What I do is.. If I want to print on console, I use:

rdd.take(20).foreach(x => println(x))

Or better, rdd.sample, rdd.sampleWithDeviation, rdd.sampleWithReplacement, sampleByKey etc. These give a broader picture with large data sets.

Then there is a rdd.toDebugString which you can print out!

Finally, you can put a breakpoint and observe the RDD in Eclipse/IntelliJ debugger, but only after evaluation.. else you will just see the execution plan but not values.

VictorGalisson
  • 635
  • 9
  • 27
Apurva Singh
  • 4,534
  • 4
  • 33
  • 42
1

Spark does not evaluate each variable as you expect, it builds a DAG that gets executed once a trigger is called (eg collect), this post explains this in more detail: How DAG works under the covers in RDD? Essentially, those intermediate variables only store the reference of the chained operations you created. If you'd like to inspect intermediate results, you'd need to call collect on each variable.

EDIT:

Forgot to mention above, that you also have the option to inspect variables inside a Spark operation. Say you break down a mapper like this:

val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.textFile("wordcount.txt")
val text1 = text.flatMap{ rec =>
  val splitStr = rec.split(" ") //can inspect this variable
  splitStr.map(r => (r, 1)) //can inspect variable r
}
val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache
val all = text2.collect() 
all.foreach(println)

You can put a breakpoint in the mapper, for example to inspect splitStr for each line of text, or in the next line to inspect r for each word.

Community
  • 1
  • 1
jamborta
  • 5,130
  • 6
  • 35
  • 55
  • Thanks Jamborta for the quick response ! I'll check out that link. So this would make it hard to debug existing code though... – Jerry May 16 '17 at 18:22
  • @Jerry maybe the additional info helps? – jamborta May 16 '17 at 18:42
  • I just used your new code and yes it worked ! Again thank you very much for your expertise !! – Jerry May 16 '17 at 19:40
  • Thanks for your inputs. That was helpful. Can you also take a look at below question and provide your comments? https://stackoverflow.com/questions/51975423/spark-streaming-with-scala-debugging-approach – rajcool111 Aug 23 '18 at 15:23
  • @jamborta Do you know if we can do the same variable inspect in pyspark? – user1269298 Sep 22 '18 at 17:55
  • @user1269298, yes you can do the same with pyspark using an IDE like pycharm. – jamborta Sep 24 '18 at 15:17
  • @jamborta thank you. I thought it is not doable in pyspark. I started a thread asking around and no one answers so far. https://stackoverflow.com/questions/52452981/pyspark-how-to-inspect-variables-within-rdd-operations – user1269298 Sep 24 '18 at 20:45