Pyspark groupby then sort within group

Question

I have a table which contains id, offset, text. Suppose input:

id offset text
1  1      hello 
1  7      world
2  1      foo

I want output like:

id text
1  hello world
2  foo

I'm using:

df.groupby(id).agg(concat_ws("",collect_list(text))

But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?

Does this answer your question? [collect\_list by preserving order based on another variable](https://stackoverflow.com/questions/46580253/collect-list-by-preserving-order-based-on-another-variable) — Neil, May 12 '23 at 17:16

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

this will create a required df:

df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)

then you can use the following code, could be optimized further:

@udf
def sort_by_offset(col):
  result =""
  text_list = col.split("-")
  for i in range(len(text_list)):
    text_list[i] = text_list[i].split(" ")
    text_list[i][0]=int(text_list[i][0])
  text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
  for i in range(len(text_list)):
    result = result+ " " +text_list[i][1]
  return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)

Final Output:

score 0 · Answer 2 · answered Aug 16 '17 at 16:35

0

Add sort_array:

from pyspark.sql.functions import sort_array

df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))

answered Aug 16 '17 at 16:35

user8473752

1

This won't work, my sort key is offset, not string itself. – Yunfei Lu Aug 17 '17 at 20:57

Pyspark groupby then sort within group

2 Answers2