0

I am currently preparing for my job interview as a Data Engineer. I am stuck with confusion. Below are the details.

If Spark RDDs are immutable by nature then why are we able to create spark RDDs with var?

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • RDDs are immutable (it means that you cannot alter the state of RDD (i.e you cannot add new records or delete records or update records inside RDD.) ) ..... "var" just holds the reference of "any variable" ... if you change reference to different object does not mean you have changed the state of rdd , it mean you have changed the reference variable state. – kavetiraviteja Aug 30 '20 at 08:52
  • if that is the case then suppose if I did var a= sc.parallelize(Seq("1","2","3")) and then if I did a=a.map(_+" is a number"). Then in such case, if I'll do foreach(println) for a then the latest changed rdd will come. but then how we will fetch that earlier code which was before map function applied. @ kavetiraviteja – Noob_developer Aug 30 '20 at 09:21

2 Answers2

2

Your confusion has little to do with Spark's RDDs. It will help to understand the difference between a variable and an object. A more familiar example:

Suppose you have a String, which we all know is an immutable type:

var text = "abc"                 //1
var text1 = text                 //2
text = text.substring(0,2)       //3

Like Spark's RDDs, Strings are immutable. But what do you think lines 1, 2, and 3 above do? Would you say that line 3 changed text? This is where your confusion is: text is a variable. When declared (line 1), text is a variable pointing to a String object ("abc") in memory. That "abc" String object is not modified by line 3, but line 3 creates a new String object ("ab"), but reuses the same variable text to point to it. To reinforce this, note that text and text1 are two different variables pointing to the same object (the same "abc" that was created by line 1)

If you see that a variable and an object it may point to are two different things, it's easy to apply this to your RDD example (it's in fact very similar to the String example above):

var a = sc.parallelize(Seq("1", "2", "3")) //like line 1 above
a = a.map(_ + " is a number")              //like line 3 above 

So, the first line creates an RDD object in memory, and then declares a variable a, then makes a point to that RDD object. The second line computes a new RDD object (off the first one), but reuses the same variable.

This means that a.map(_ + " is a number") creates a new RDD object from the first one (and the first one is just no longer assigned to a variable because you reused the same variable to point to the derived RDD).

In short, then: when we say that Spark's RDDs are immutable, we mean that those objects (not the variables pointing to them) cannot be mutated (the object's structure in memory cannot be modified) even if the non-final variables pointing to them can be reassigned, just as is the case with String objects.

This is about programming fundamentals, and I'd suggest you go through some analogies on this post: What is the difference between a variable, object, and reference?

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Ok I agree with you but if that is the case then suppose if I did var a= sc.parallelize(Seq("1","2","3")) and then if I did a=a.map(_+" is a number"). Then in such a case, if I'll do foreach(println) for a then the latest changed rdd will come. but then how we will fetch that earlier code which was before map function applied. – Noob_developer Aug 30 '20 at 13:58
  • Not sure exactly how that's different from your earlier example, which I addressed in this answer. If you do `a = a.map(_+" is a number")` then the first RDD you created with `var a= sc.parallelize(Seq("1","2","3"))` is not directly reachable anymore because you overwrote the variable with the derived RDD. If you want access to both the original and the derived RDDs, then you have to use two variables, like `var b = a.map(_+" is a number")`. – ernest_k Aug 30 '20 at 14:17
  • @Noob_developer Your example is the same as if you did `var a = sc.parallelize(Seq("1","2","3")).map(_+" is a number")`. Two RDDs are created, but just the last one is assigned to `a` in the end. – ernest_k Aug 30 '20 at 14:19
1

even if you don't make a Var explicitly to hold the RDD spark will return some reference anyways eg - res0

scala> sc.textFile("/user/root/data/words.txt")
res0: org.apache.spark.rdd.RDD[String] = /user/root/data/words.txt 
MapPartitionsRDD[1] at textFile at <console>:25

scala> res0.count()
res1: Long = 9

spark mostly depends upon transformations or actions that means either you will be making a new rdd with data from that rdd or retrieving info from it, for that you need some reference and val will be only that conveniently named reference