I am currently preparing for my job interview as a Data Engineer. I am stuck with confusion. Below are the details.
If Spark RDDs are immutable by nature then why are we able to create spark RDDs with var?
I am currently preparing for my job interview as a Data Engineer. I am stuck with confusion. Below are the details.
If Spark RDDs are immutable by nature then why are we able to create spark RDDs with var?
Your confusion has little to do with Spark's RDDs. It will help to understand the difference between a variable and an object. A more familiar example:
Suppose you have a String, which we all know is an immutable type:
var text = "abc" //1
var text1 = text //2
text = text.substring(0,2) //3
Like Spark's RDDs, String
s are immutable. But what do you think lines 1, 2, and 3 above do? Would you say that line 3 changed text
? This is where your confusion is: text
is a variable. When declared (line 1), text
is a variable pointing to a String object ("abc"
) in memory. That "abc"
String object is not modified by line 3, but line 3 creates a new String
object ("ab"
), but reuses the same variable text
to point to it. To reinforce this, note that text
and text1
are two different variables pointing to the same object (the same "abc"
that was created by line 1)
If you see that a variable and an object it may point to are two different things, it's easy to apply this to your RDD example (it's in fact very similar to the String example above):
var a = sc.parallelize(Seq("1", "2", "3")) //like line 1 above
a = a.map(_ + " is a number") //like line 3 above
So, the first line creates an RDD object in memory, and then declares a variable a
, then makes a
point to that RDD object. The second line computes a new RDD object (off the first one), but reuses the same variable.
This means that a.map(_ + " is a number")
creates a new RDD object from the first one (and the first one is just no longer assigned to a variable because you reused the same variable to point to the derived RDD).
In short, then: when we say that Spark's RDDs are immutable, we mean that those objects (not the variables pointing to them) cannot be mutated (the object's structure in memory cannot be modified) even if the non-final variables pointing to them can be reassigned, just as is the case with String objects.
This is about programming fundamentals, and I'd suggest you go through some analogies on this post: What is the difference between a variable, object, and reference?
even if you don't make a Var explicitly to hold the RDD spark will return some reference anyways eg - res0
scala> sc.textFile("/user/root/data/words.txt")
res0: org.apache.spark.rdd.RDD[String] = /user/root/data/words.txt
MapPartitionsRDD[1] at textFile at <console>:25
scala> res0.count()
res1: Long = 9
spark mostly depends upon transformations or actions that means either you will be making a new rdd with data from that rdd or retrieving info from it, for that you need some reference and val will be only that conveniently named reference