udf that sorts list in pyspark

Question

I have a dataframe where one of the column, called stopped is:

+--------------------+
|             stopped|
+--------------------+
|[nintendo, dsi, l...|
|[nintendo, dsi, l...|
|    [xl, honda, 500]|
|[black, swan, green]|
|[black, swan, green]|
|[pin, stripe, sui...|
|  [shooting, braces]|
|      [haus, geltow]|
|[60, cm, electric...|
|  [yamaha, yl1, yl2]|
|[landwirtschaft, ...|
|     [wingbar, 9581]|
|       [gummi, 16mm]|
|[brillen, lupe, c...|
|[man, city, v, ba...|
|[one, plus, one, ...|
|     [kapplocheisen]|
|[tractor, door, m...|
|[pro, nano, flat,...|
|[kaleidoscope, to...|
+--------------------+

I would like to create another column that contains the same list but where the keywords are ordered.

As I understand it, I need to create a udf that takes and returns a list:

udf_sort = udf(lambda x: x.sort(), ArrayType(StringType()))
ps_clean.select("*", udf_sort(ps_clean["stopped"])).show(5, False)

and I get:

+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|client_id|kw_id     |keyword              |max_click_dt|tokenized                 |stopped                   |<lambda>(stopped)|
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|710      |4304414582|nintendo dsi lite new|2017-01-06  |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null             |
|705      |4304414582|nintendo dsi lite new|2017-03-25  |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null             |
|707      |647507047 |xl honda 500 s       |2016-10-26  |[xl, honda, 500, s]       |[xl, honda, 500]          |null             |
|710      |26308464  |black swan green     |2016-01-01  |[black, swan, green]      |[black, swan, green]      |null             |
|705      |26308464  |black swan green     |2016-07-13  |[black, swan, green]      |[black, swan, green]      |null             |
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+

Why is the sorting not being applied?

score 2 · Accepted Answer · answered Jul 03 '17 at 15:17

x.sort() typically sorts the list in place (but I suspect that it won't do that in a pyspark dataframe) and it returns None. That is the reaason your column labeled <lambda>(stopped) has all null values.sorted(x) will sort the list and return a new sorted copy. So, replacing your udf with

udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))

should solve your problem.

Alternatively, you can use the built-in function sort_array instead of defining your own udf.

from pyspark.sql.functions import sort_array

ps_clean.select("*", sort_array(ps_clean["stopped"])).show(5, False)

This method is a little cleaner, and you can actually expect to get some performance gains because pyspark doesn't have to serialize your udf.

score 1 · Answer 2 · answered Jul 03 '17 at 15:15

1

change Your udf to:

udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))

on diffrences beetwen .sort() and .sorted() read:

What is the difference between `sorted(list)` vs `list.sort()` ? python

answered Jul 03 '17 at 15:15

Konrad Kostrzewa

825
7
16

udf that sorts list in pyspark

2 Answers2