2

I can neither use pyspark or scala. I can only write SQL code. I have a table with 2 columns item id, name.

item_id, name
1        name1
1        name2
1        name3
2        name4
2        name5

I want to generate results with the names of an item_id concatenated.

item_id,    names
1           name1-name2-name3
2           name4-name5

How do I create such a table with Spark sql?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
raju
  • 4,788
  • 15
  • 64
  • 119
  • Does this answer your question? [pyspark collect\_set or collect\_list with groupby](https://stackoverflow.com/questions/37580782/pyspark-collect-set-or-collect-list-with-groupby) – yahoo Oct 15 '20 at 07:40

3 Answers3

2

The beauty of Spark SQL is that once you have a solution in any of the supported languages (Scala, Java, Python, R or SQL) you can somewhat figure out other variants.

The following SQL statement seems doing what you ask for:

SELECT item_id, array_join(collect_list(name), '-') as names 
FROM tableName
GROUP BY item_id

In spark-shell it gives the following result:

scala> sql("select item_id, array_join(collect_list(name), '-') as names from so group by item_id").show
+-------+-----------------+
|item_id|            names|
+-------+-----------------+
|      1|name1-name2-name3|
|      2|      name4-name5|
+-------+-----------------+
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
1

You can try the below -

df.orderBy('names', ascending=False)
    .groupBy('item_id')
    .agg(
        array_join(
            collect_list('names'),
            delimiter='-',
        ).alias('names')
    )
Fahmi
  • 37,315
  • 5
  • 22
  • 31
0

You can use Spark data frame's groupBy and agg methods and concat_ws function:

df.groupBy($"item_id").agg(concat_ws("-", collect_list($"name")).alias("names")).show()

Group fields by item_id and aggregating each name field by concatenating them together.

Adam
  • 3,891
  • 3
  • 19
  • 42