0

I have two dataframes that I am joining together. The left is a list of words (word, tail, id) from a Russian text, the right contains a list of all possible stressed versions of these words. The idea is essentially to get a list of suggestions for each word.

The join results in the following dataframe. The words with id 2 and 5 have more than one suggestion each:

+------------+----+---+----------+------+----+-----------------+-------+
|        word|tail| id|  stressed|  form|type|        word_name|word_id|
+------------+----+---+----------+------+----+-----------------+-------+
|Преступление|    |  0|      null|  null|null|             null|   null|
|           и|    |  1|         и́|Custom|null|                и| 213777|
|   наказание|    |  2| наказа́ние| VinSg|  No|        наказание| 293260|
|   наказание|    |  2| наказа́ние| ImeSg|  No|        наказание| 293260|
|       Роман|    |  3|     Рома́н| ImeSg|  No|            Роман|  46020|
|           в|    |  4|         в|Custom|null|                в| 112374|
|       шести|    |  5|     шести́|   Rod|  Nu|            шесть| 545203|
|       шести|    |  5|     шести́|   Dat|  Nu|            шесть| 545203|
|       шести|    |  5|     шести́|   Pre|  Nu|            шесть| 545203|

What I would like is, rather than outputting an additional line for every recommendation (classic left outer join), to keep the original number of lines with a nested array of objects from the right table (i.e. a 'suggestions' column).

I can get close to this by using the collect_list aggregator, which yields the following results:

+---+------------+----------------------+---------------------+
| id|        word|collect_list(stressed)|collect_list(word_id)|
+---+------------+----------------------+---------------------+
|  0|Преступление|                    []|                   []|
|  1|           и|                   [и́]|             [213777]|
|  2|   наказание|   [наказа́ние, нака...|     [293260, 293260]|
|  3|       Роман|               [Рома́н]|              [46020]|
|  4|           в|                   [в]|             [112374]|
|  5|       шести|    [шести́, шести́, ...| [545203, 545203, ...|

I know that this is possible structurally, but I'm struggling to find an idiomatic way (i.e. without UDFs) of implementing it.

pwwolff
  • 606
  • 1
  • 5
  • 20
  • Possible duplicate of [SPARK SQL replacement for mysql GROUP\_CONCAT aggregate function](https://stackoverflow.com/questions/31640729/spark-sql-replacement-for-mysql-group-concat-aggregate-function) – Masoud R Jul 13 '19 at 09:33
  • I found the answer, but not in this thread – pwwolff Jul 13 '19 at 11:13

1 Answers1

0

After a little bit more research, I found what I was looking for here

The trick is to select the columns that you want to aggregate as a struct before the join:

val dfAccentsGrouped = {
    dfAccentPairs.select($"unstressed", struct("word_id", "stressed", "form", "type"))
}

then, similar to what I had before:

val dfWithSuggestions = {
    dfWords
    .join(dfAccentsGrouped, $"word" === $"unstressed","left_outer").drop("unstressed")
    .groupBy("id", "word", "tail").agg(collect_list("suggestion") as "suggestions")
}

Which then yields exactly the result I was aiming for:

+---+------------+----+------------------------------------------------------------------------------------------+
|id |word        |tail|suggestions                                                                               |
+---+------------+----+------------------------------------------------------------------------------------------+
|0  |Преступление|    |[]                                                                                        |
|1  |и           |    |[[213777, и́, Custom,]]                                                                   |
|2  |наказание   |    |[[293260, наказа́ние, ImeSg, No], [293260, наказа́ние, VinSg, No]]                        |
|3  |Роман       |    |[[46020, Рома́н, ImeSg, No]]                                                              |
|4  |в           |    |[[112374, в, Custom,]]                                                                    |
|5  |шести       |    |[[545203, шести́, Rod, Nu], [545203, шести́, Dat, Nu], [545203, шести́, Pre, Nu]]         |
pwwolff
  • 606
  • 1
  • 5
  • 20