I have started studying pyspark and for some reason I'm not able to get the concept of Resiliency property of RDD. My understanding is RDD is a data structure like dataframe in pandas and is immutable. But I wrote a code (shown below) and it works.
file = sc.textfile('file name')
filterData = file.map(lambda x: x.split(','))
filterData = filterData.reduceByKey(lambda x,y: x+y)
filterData = filterData.sortBy(lambda x: x[1])
result = filterData.collect()
Doesn't this violate the immutable property as you can see I'm modifying the same RDD again and again.
File is a csv file with 2 columns. column 1 is an id and column 2 is just some integer. Can you guys please explain where I'm going wrong with my understanding.