0

I am doing data analysis with PySpark. I'm trying to get the aggregated count of unique rows and then join the count value back to the original data frame, so that the data frame is once again not aggregated but retains the counts of number of occurrences of the row in the data frame. It would seem to me that the appropriate way to do this would be:

df.join(df.groupBy(df.columns).count(), df.columns, 'left')

However, upon inspection, this results in NULLs for the count column. Perhaps I am doing the wrong type of join? Any thoughts?

Chris C
  • 599
  • 2
  • 8
  • 19
  • I'm unable to reproduce- I just tried this on a sample dataframe and your code worked for me. Can you try to provide a [mcve]? See this post on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Jul 23 '18 at 15:55
  • 1
    As an aside you can also achieve the same output using a `pyspark.sql.Window`. First `import pyspark.sql.functions as f` and `from pyspark.sql import Window`, then try `df.select("*", f.count("*").over(Window.partitionBy(df.columns)).alias('count'))` – pault Jul 23 '18 at 15:59
  • Same here, unable to reproduce. Statement provided by OP works fine. – Sailesh Kotha Jul 23 '18 at 18:02
  • Doesn't work in my case, and I can't provide an MPE because I'm processing a lot of sensitive data and I'm not even sure what the problem is in the first place or I wouldn't be posting here. However, the solution provided by @pault solves my problem. Another solution is to create a `monotonically_increasing_id` primary key, group by aggregate count in which you `collect_to_list` the PK, and then `explode` the listed PK and reorder by the PK. – Chris C Jul 23 '18 at 18:15

0 Answers0