0

I have started studying pyspark and for some reason I'm not able to get the concept of Resiliency property of RDD. My understanding is RDD is a data structure like dataframe in pandas and is immutable. But I wrote a code (shown below) and it works.

file = sc.textfile('file name')
filterData = file.map(lambda x: x.split(','))
filterData = filterData.reduceByKey(lambda x,y: x+y)
filterData = filterData.sortBy(lambda x: x[1])
result = filterData.collect()

Doesn't this violate the immutable property as you can see I'm modifying the same RDD again and again.

File is a csv file with 2 columns. column 1 is an id and column 2 is just some integer. Can you guys please explain where I'm going wrong with my understanding.

fellowCoder
  • 69
  • 10
  • you are not actually modifying the same rdd, but creating a copy of an existing rdd with some transformations and assigning it to the same variable. – anky Sep 25 '21 at 05:45
  • In that case, what happens to the original RDD. Does it remain in the memory? How does it get cleared? – fellowCoder Sep 25 '21 at 06:05
  • 1
    https://stackoverflow.com/questions/23045371/what-happens-to-memory-locations-in-python-when-you-overwrite-a-variable - This should answer your question in comments. Also this question seems related: https://stackoverflow.com/questions/63655617/spark-rdd-immutability-confusion – anky Sep 25 '21 at 06:48

0 Answers0