I want to create a table where each row is a unique ID and the Place and City column consists of all the places and cities a person visited , ordered by the date of visit , either using Pyspark or Hive.
df.groupby("ID").agg(F.concat_ws("|",F.collect_list("Place")))
does the concatenation but I am unable to order it by the date. Also for each column I need to keep doing this step separately.
I also tried using windows function as mentioned in this post, (collect_list by preserving order based on another variable) but it trows an error :java.lang.UnsupportedOperationException: 'collect_list(') is not supported in a window operation. I want to :
1- order the concatenated columns in order of the date travelled
2- do this step for multiple columns
Data
| ID | Date | Place | City |
| 1 | 2017 | UK | Birm |
| 2 | 2014 | US | LA |
| 1 | 2018 | SIN | Sin |
| 1 | 2019 | MAL | KL |
| 2 | 2015 | US | SF |
| 3 | 2019 | UK | Lon |
Expected
| ID | Place | City |
| 1 | UK,SIN,MAL | Birm,Sin,KL |
| 2 | US,US | LA,SF |
| 3 | UK | Lon |