Spark DataFrame how to change permutation of one column without join

Question

I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.

I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).

# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show() 
# column "ha" no longer sorted and has same permutation as spark_df.colname

Thanks for any guidance in helping me understand this, I am a complete beginner with this :)

Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)

score 1 · Answer 1 · answered Jun 07 '19 at 06:49

Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.

Try new_df.withColumn("ha", new_df[colname]) instead.

Spark DataFrame how to change permutation of one column without join

1 Answers1