Filter an array in pyspark dataframe

Question

Spark version: 2.3.0

I have a PySpark dataframe that has an Array column, and I want to filter the array elements by applying some string matching conditions. Eg: If I had a dataframe like this

       Array Col
['apple', 'banana', 'orange']
['strawberry', 'raspberry']
['apple', 'pineapple', 'grapes']

I would want to filter the elements within each array that contain the string 'apple' or, start with 'app' etc. How would I achieve this in PySpark?

Could someone tell me how I can implement it in pyspark?

Does this answer your question? [Filter array column content](https://stackoverflow.com/questions/53193144/filter-array-column-content) — blackbishop, Oct 01 '21 at 15:02

anky · Answer 1 · 2021-10-01T14:55:18.683

4

You can use higher order functions from spark 2.4+:

df.withColumn("Filtered_Col",F.expr(f"filter(Array_Col,x -> x rlike '^(?i)app' )")).show()

+--------------------------+------------+
|Array_Col                 |Filtered_Col|
+--------------------------+------------+
|[apple, banana, orange]   |[apple]     |
|[strawberry, raspberry]   |[]          |
|[apple, pineapple, grapes]|[apple]     |
+--------------------------+------------+

For lower versions, you are probably good to use an udf:

import re
def myf(v):
  l=[]
  for i in v:
    if bool(re.match('^(?i)app',i)):
      l.append(i)
  return l
myudf = F.udf(myf,T.ArrayType(T.StringType()))
df.withColumn("Filtered_Col",myudf("Array_Col")).show()

edited Oct 01 '21 at 14:55

answered Oct 01 '21 at 14:26

anky

74,114
11
41
70

Thanks for the answer. But, unfortunately, I can only use spark 2.3.0 in my organization. :( Do you have an alternate solution for that? I'll edit my question to mention the same – AnonymousMe Oct 01 '21 at 14:27
1

@AnonymousMe updated with another alternative for lower versions. both with match both `Apple` and `apple`, remove the `(?i)` if you dont want to match the capital `A` – anky Oct 01 '21 at 14:34
1

@AnonymousMe Also, looks like you want to also filter out rows with 0 matches? its not addressed in OP , however you can chain a `.where` with `size` check: `df.withColumn("Filtered_Col",myudf("Array_Col")).where(F.size("Filtered_Col")>0).show() ` – anky Oct 01 '21 at 14:56

score 3 · Accepted Answer · answered Oct 01 '21 at 14:34

You can use filter in conjunction with exist which comes under Higher Order Functions that will check if any of the elements within the array contains the word

The other approach would be a UDF -

Data Preparation

sparkDF = sql.createDataFrame([(['apple', 'banana', 'orange'],),
                               (['strawberry', 'raspberry'],),
                               (['apple', 'pineapple', 'grapes'],)
                              ]
                              ,['arr_column']
                             )


sparkDF.show(truncate=False)

+--------------------------+
|arr_column                |
+--------------------------+
|[apple, banana, orange]   |
|[strawberry, raspberry]   |
|[apple, pineapple, grapes]|
+--------------------------+

Filter & Exists >= Spark 2.4

starts_with_app = lambda s: s.startswith("app")

sparkDF_filtered = sparkDF.filter(F.exists(F.col("arr_column"), starts_with_app))

sparkDF_filtered.show(truncate=False)

+--------------------------+
|arr_column                |
+--------------------------+
|[apple, banana, orange]   |
|[apple, pineapple, grapes]|
+--------------------------+

UDF - Lower Versions as well

def filter_string(inp):
    
    res = []
    
    for s in inp:
        if s.startswith("app"):
            res += [s]
    if res:
        return res
    else:
        return None

filter_string_udf = F.udf(lambda x: filter_string(x),ArrayType(StringType()))

sparkDF_filtered = sparkDF.withColumn('arr_filtered',filter_string_udf(F.col('arr_column')))\
                          .filter(F.col('arr_filtered').isNotNull())

sparkDF_filtered.show(truncate=False)

+--------------------------+------------+
|arr_column                |arr_filtered|
+--------------------------+------------+
|[apple, banana, orange]   |[apple]     |
|[apple, pineapple, grapes]|[apple]     |
+--------------------------+------------+

Filter an array in pyspark dataframe

2 Answers2

Data Preparation

Filter & Exists >= Spark 2.4

UDF - Lower Versions as well

Linked