I have two dataframes that I am joining together. The left is a list of words (word, tail, id) from a Russian text, the right contains a list of all possible stressed versions of these words. The idea is essentially to get a list of suggestions for each word.
The join results in the following dataframe. The words with id 2 and 5 have more than one suggestion each:
+------------+----+---+----------+------+----+-----------------+-------+
| word|tail| id| stressed| form|type| word_name|word_id|
+------------+----+---+----------+------+----+-----------------+-------+
|Преступление| | 0| null| null|null| null| null|
| и| | 1| и́|Custom|null| и| 213777|
| наказание| | 2| наказа́ние| VinSg| No| наказание| 293260|
| наказание| | 2| наказа́ние| ImeSg| No| наказание| 293260|
| Роман| | 3| Рома́н| ImeSg| No| Роман| 46020|
| в| | 4| в|Custom|null| в| 112374|
| шести| | 5| шести́| Rod| Nu| шесть| 545203|
| шести| | 5| шести́| Dat| Nu| шесть| 545203|
| шести| | 5| шести́| Pre| Nu| шесть| 545203|
What I would like is, rather than outputting an additional line for every recommendation (classic left outer join), to keep the original number of lines with a nested array of objects from the right table (i.e. a 'suggestions' column).
I can get close to this by using the collect_list aggregator, which yields the following results:
+---+------------+----------------------+---------------------+
| id| word|collect_list(stressed)|collect_list(word_id)|
+---+------------+----------------------+---------------------+
| 0|Преступление| []| []|
| 1| и| [и́]| [213777]|
| 2| наказание| [наказа́ние, нака...| [293260, 293260]|
| 3| Роман| [Рома́н]| [46020]|
| 4| в| [в]| [112374]|
| 5| шести| [шести́, шести́, ...| [545203, 545203, ...|
I know that this is possible structurally, but I'm struggling to find an idiomatic way (i.e. without UDFs) of implementing it.