Pyspark: Count how many rows have the same value in two columns and drop the duplicates

Question

I have a dataset like this one:

  +----------+------------+-----+
  |id        |event       |next |
  +----------+------------+-----+
  | 1        |A           |X    |
  | 2        |B           |Y    |
  | 3        |C           |Z    |
  | 4        |C           |X    |
  | 5        |A           |X    |
  | 6        |D           |Y    |
  | 7        |B           |Y    |
  +----------+------------+-----+

I would like to count how have the same value in both the column "event" and the column "next" and add another column with that count. Then I would like to keep only one such row and delete the other rows.

  
  +----------+------------+-----+-------+
  |id        |event       |next | count |
  +----------+------------+-----+-------+
  | 1        |A           |X    |2      |
  | 2        |B           |Y    |2      |
  | 3        |C           |Z    |1      |
  | 4        |C           |X    |1      |
  | 6        |D           |Y    |1      |
  +----------+------------+-----+-------+

How could I do this in Pyspark? Thank you!

what did you try? Looks fairly simple? Did you check groupby and aggregate as count? — anky, Sep 17 '21 at 15:49
@anky I know how to count for one column: df = df.withColumn("count", F.count(df['event]).over(Window.partitionBy("event"))) but not for 2 — lalala, Sep 17 '21 at 15:58
I dont think you need windowing here, its just an aggregate something on the lines of `df.groupBy("Event","next").agg(count("event"),first("id")).show()`, more details in teh duplicate link and similar to [this](https://stackoverflow.com/a/49984522/9840637) — anky, Sep 17 '21 at 16:05

Pyspark: Count how many rows have the same value in two columns and drop the duplicates

0 Answers0