0

I have a dataframe as below:

+--------------------+--------------------+
|                 _id|           statement|
+--------------------+--------------------+
|                   1|            ssssssss|
|                   2|            ssssssss|
|                   3|            aaaaaaaa|
|                   4|            aaaaaaaa|
+--------------------+--------------------+

After using df.dropDuplicates(['statement']), I got this:

+--------------------+--------------------+
|                 _id|           statement|
+--------------------+--------------------+
|                   1|            ssssssss|
|                   3|            aaaaaaaa|
+--------------------+--------------------+

But actually, I want to keep the _id value as below:

+--------------------+--------------------+
|                 _id|           statement|
+--------------------+--------------------+
|                1, 2|            ssssssss|
|                3, 4|            aaaaaaaa|
+--------------------+--------------------+

How could I do?

1 Answers1

0

Finally find my answer in combine text from multiple rows in pyspark

sdf.groupBy('lstatement').agg(F.collect_list('_id').alias("_id")).show()