0

I have this pyspark dataframe:

Borough, Neighborhood, Count, Row_Number
B, UES, 5, 1
B, MID, 10, 2
B, UWS, 4, 3
BR, EV, 1, 1
BR, WB, 4, 2 
BR, MID, 5, 3

I want to transform it into something like this:

Borough, Neighborhood_1, Count_1, Neighborhood_2, Count_2, Neighborhood_3, Count_3
B, UES, 5, MID, 10, UWS, 4
BR, EV, 1, WB, 4, MID, 5

This is what I could come up with:

df.groupBy('Borough').pivot('Neighborhood')

But I keep getting error. Can anyone point out the mistake?

russel
  • 1
  • 2

1 Answers1

0

This is how you should do it.

df = df.withColumn("row_num", F.row_number(Window.partitionBy("Borough").orderBy("Count")))
df.withColumn("neighbour_dummy", F.concat_ws("_", F.lit("Neighbour"), F.col("row_num"))

pivot_neigh = df.groupBy("Borough").pivot("neighbour_dummy").agg(F.first("Neighbourhood"))

Similarly do it for Count column and then Join the two datasets. You can refer here to learn how Window, Concat_WS, etc. works - https://www.youtube.com/playlist?list=PLI57HEydB_p7TogaXyArlE1ZNw4sgHmsd

Prateek Jain
  • 547
  • 1
  • 5
  • 16
  • 3
    you cant put row_number like that, it should be `F.row_number().over(Window.partitionBy("Borough").orderBy("Count")` – murtihash Apr 27 '20 at 16:03
  • Prateek Jain, thank you, please refer to problem, I already have row number, is there any way to re-use it? – russel Apr 27 '20 at 16:10
  • Where is row_number??? Is Count column the one you're referring to? If yes, then in my above code use "Count" instead of "row_num" – Prateek Jain Apr 27 '20 at 16:23
  • he added row_recently. you could use something like `df.withColumn("neighbour_dummy", F.concat_ws("_", F.lit("Neighbourhood"), F.col("Row_Number"))) .groupBy("Borough").pivot("neighbour_dummy").agg(F.first("Neighborhood").alias(""), F.first("count").alias("count")).show()`. prateeks answer should be accepted solution – murtihash Apr 27 '20 at 16:29