If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

Question

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.

My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?

I think when you use `withColumn` you actually create a new dataframe, not modifying the current dataframe. — Ali AzG, Nov 19 '18 at 12:11

score 21 · Accepted Answer · answered Nov 19 '18 at 13:30

As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.

Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.

However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement

df = df.withColumn()

It will generate another dataframe and assign it to reference "df".

In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.

df.rdd.id()

will give you unique identifier for your dataframe.

I hope the above explanation helps.

Regards,

Neeraj

"when you apply such operations on DataFrames it will generate a new data frame" - it's also possible there is no new copy created, and what's created is just a "view" of the original. We can't really tell, as DataFrames are immutable. — flow2k, Aug 30 '19 at 21:18

score 4 · Answer 2 · answered Nov 19 '18 at 12:13

4

You aren't; the documentation explicitly says

Returns a new Dataset by adding a column or replacing the existing column that has the same name.

If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.

answered Nov 19 '18 at 12:13

Alexey Romanov

167,066
35
309
487

Doesn't something like this work in PySpark? `dataframe = dataframe.withColumn("col1", when(col("col1") == "val1", "V").otherwise(col("col1"))` Not too sure about the syntax. – pallupz Nov 19 '18 at 12:20
2

You can reassign the variable, but that doesn't mean the original value changes. Any more than integers are mutable because you can write `i = i + 1`. By contrast, python lists _are_ mutable: https://stackoverflow.com/questions/24292174/are-python-lists-mutable. – Alexey Romanov Nov 19 '18 at 12:27

adi555 · Answer 3 · 2021-03-11T10:57:25.503

The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well. When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether. Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

3 Answers3

Linked