Filter array column content

Question

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:

>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2|           col3|
+----+----+---------------+
|   1|   A|   [1, 2, 3, 4]|
|   2|   B|[1, 2, 3, 4, 5]|
+----+----+---------------+

The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:

>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2|   newcol|
+----+----+---------+
|   1|   A|   [3, 4]|
|   2|   B|[3, 4, 5]|
+----+----+---------+

I have already an udf solution, but it is very slow (> 1 billions data rows):

largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))

Any help is welcome. Thank you very much in advance.

You can't iterate over an array. You can `explode`, `filter`, and `collect_list` to avoid the `udf`, but this is also an expensive operation. You can also serialize to `rdd`. See [related](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) — pault, Nov 07 '18 at 16:34

score 11 · Answer 1 · edited Nov 07 '18 at 18:27

11

Spark < 2.4

There is no *reasonable replacement for udf in PySpark.

Spark >= 2.4

Your code:

expr("filter(col3, x -> x >= 3)")

can be used as is.

Reference

Querying Spark SQL DataFrame with complex types

* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

edited Nov 07 '18 at 18:27

pault

41,343
15
107
149

answered Nov 07 '18 at 17:03

user10619641

111
2

Filter array column content

1 Answers1

Linked