I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show()
shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)