I have a Pyspark Dataframe in the following format:
+------------+---------+
| date | query |
+------------+---------+
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 1 |
| 2011-08-11 | Query 2 |
| 2011-08-12 | Query 3 |
| 2011-08-12 | Query 3 |
| 2011-08-13 | Query 1 |
+------------+---------+
And I need to transform it to turn each unique query into a column, grouped by date, and insert the count of each query in the rows of the dataframe. I expect the output to be like this:
+------------+---------+---------+---------+
| date | Query 1 | Query 2 | Query 3 |
+------------+---------+---------+---------+
| 2011-08-11 | 2 | 1 | 0 |
| 2011-08-12 | 0 | 0 | 2 |
| 2011-08-13 | 1 | 0 | 0 |
+------------+---------+---------+---------+
I am trying to use this answer as example, but I don't quite understand the code, especially the return
statement in the make_row
function.
Is there a way to count the queries while transforming the DataFrame? Maybe something like
import pyspark.sql.functions as func
grouped = (df
.map(lambda row: (row.date, (row.query, func.count(row.query)))) # Just an example. Not sure how to do this.
.groupByKey())
It is a dataframe with potentially hundreds of thousands of rows and queries, so I prefer the RDD version over the options that use a .collect()
Thank you!