0

Can any one please advise on Spark-SQL query that is used to combine multiple rows based on column1, date(order asc) into one row by considering the column1 unique values. Below is the data

This is the enrollment table that has data in this way:

column1 column2 timeStamp

abc enrolled 2022/09/01
abc changed 2022/09/02
abc registered 2022/09/04
abc blocked 2022/09/05
abc left 2022/09/06
def enrolled 2022/09/20
def changed 2022/09/21
def changed 2022/09/21
def changed 2022/09/24
def left 2022/09/25
ghi registered 2022/09/01
ghi changed 2022/09/02
ghi left 2022/09/03
ghi returned 2022/10/03

Needed the output of the query to be like below:

out_column1 out_column2

abc enrolled-changed-registered-blocked-left
def enrolled-changed-changed-left
ghi registered-changed-left-returned

group_concat function is not available in pySpark

hellopa1
  • 29
  • 4
  • 1
    If you tagged the question incorrectly and thus the answers you received are not usable for you, then I suggest you to delete the question and re-ask it with the correct tagging this time. The reason is that the answers (and the duplicate closure) were provided in good faith in line with the question and its tags. It is unfair to subject the answerers to potential downvotes just because their answers are no longer correct because of the changes you made to the question. – Shadow Oct 19 '22 at 20:35

1 Answers1

0
select   column1                                                as out_column1
        ,group_concat(column2 order by timeStamp separator '-') as out_column2
from     t
group by column1
out_column1 out_column2
abc enrolled-changed-registered-blocked-left
def enrolled-changed-changed-changed-left
ghi registered-changed-left-returned

Fiddle

DannySlor
  • 4,574
  • 1
  • 3
  • 11