I am doing data analysis with PySpark. I'm trying to get the aggregated count of unique rows and then join the count value back to the original data frame, so that the data frame is once again not aggregated but retains the counts of number of occurrences of the row in the data frame. It would seem to me that the appropriate way to do this would be:
df.join(df.groupBy(df.columns).count(), df.columns, 'left')
However, upon inspection, this results in NULLs for the count
column. Perhaps I am doing the wrong type of join? Any thoughts?