Pivot and Cumcount in PYSPARK

Question

I have this pyspark dataframe:

Borough, Neighborhood, Count, Row_Number
B, UES, 5, 1
B, MID, 10, 2
B, UWS, 4, 3
BR, EV, 1, 1
BR, WB, 4, 2 
BR, MID, 5, 3

I want to transform it into something like this:

Borough, Neighborhood_1, Count_1, Neighborhood_2, Count_2, Neighborhood_3, Count_3
B, UES, 5, MID, 10, UWS, 4
BR, EV, 1, WB, 4, MID, 5

This is what I could come up with:

df.groupBy('Borough').pivot('Neighborhood')

But I keep getting error. Can anyone point out the mistake?

try `df.groupBy("Borough").pivot("Neighborhood").sum("Count")` — murtihash, Apr 27 '20 at 15:52
Don't you think this will sum all neighborhoods under one borough? — russel, Apr 27 '20 at 15:55
check if this helps,https://stackoverflow.com/questions/45035940/how-to-pivot-on-multiple-columns-in-spark-sql — Suresh, Apr 27 '20 at 17:31

score 0 · Answer 1 · answered Apr 27 '20 at 15:58

0

This is how you should do it.

df = df.withColumn("row_num", F.row_number(Window.partitionBy("Borough").orderBy("Count")))
df.withColumn("neighbour_dummy", F.concat_ws("_", F.lit("Neighbour"), F.col("row_num"))

pivot_neigh = df.groupBy("Borough").pivot("neighbour_dummy").agg(F.first("Neighbourhood"))

Similarly do it for Count column and then Join the two datasets. You can refer here to learn how Window, Concat_WS, etc. works - https://www.youtube.com/playlist?list=PLI57HEydB_p7TogaXyArlE1ZNw4sgHmsd

answered Apr 27 '20 at 15:58

Prateek Jain

547
1
5
16

3

you cant put row_number like that, it should be `F.row_number().over(Window.partitionBy("Borough").orderBy("Count")` – murtihash Apr 27 '20 at 16:03
Prateek Jain, thank you, please refer to problem, I already have row number, is there any way to re-use it? – russel Apr 27 '20 at 16:10
Where is row_number??? Is Count column the one you're referring to? If yes, then in my above code use "Count" instead of "row_num" – Prateek Jain Apr 27 '20 at 16:23
he added row_recently. you could use something like `df.withColumn("neighbour_dummy", F.concat_ws("_", F.lit("Neighbourhood"), F.col("Row_Number"))) .groupBy("Borough").pivot("neighbour_dummy").agg(F.first("Neighborhood").alias(""), F.first("count").alias("count")).show()`. prateeks answer should be accepted solution – murtihash Apr 27 '20 at 16:29

Pivot and Cumcount in PYSPARK

1 Answers1