How to remove words that have less than three letters in PySpark?

Question

I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?

from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate()

columns = ['id', 'text']
vals = [
    (1, ['I', 'am', 'good']),
    (2, ['You', 'are', 'ok']),
]

df = spark.createDataFrame(vals, columns)
df.show()

# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word 
# in col('text')], ''))
# df_clean.show()

I expect to see:

id  |  text  
1   |  [good]
2   |  [You, are]

Possible duplicate of [Filter array column content](https://stackoverflow.com/questions/53193144/filter-array-column-content). TL;DR: there's no easy way to do this in (at least in Spark versions 2.3 and below). You can `explode`, `filter`, `groupby`, and `collect_list` or use a `udf`. — pault, Nov 26 '18 at 17:50
You may also find [this post](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) to be useful. — pault, Nov 26 '18 at 17:54

score 2 · Answer 1 · answered Nov 26 '18 at 17:55

2

This does it, you can decide to exclude row or not, I added an extra column and filtered out, but options are yours:

from pyspark.sql import functions as f

columns = ['id', 'text']
vals = [
        (1, ['I', 'am', 'good']),
        (2, ['You', 'are', 'ok']),
        (3, ['ok'])
       ]

df = spark.createDataFrame(vals, columns)
#df.show()

df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()

# This is the actual piece of logic you are looking for.
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()

returns:

+---+--------------+--------------+
| id|          text|text_left_over|
+---+--------------+--------------+
|  1| [I, am, good]|        [good]|
|  2|[You, are, ok]|    [You, are]|
|  3|          [ok]|            []|
+---+--------------+--------------+

+---+--------------+
| id|text_left_over|
+---+--------------+
|  1|        [good]|
|  2|    [You, are]|
+---+--------------+

answered Nov 26 '18 at 17:55

thebluephantom

16,458
8
40
83

I am in Spark 2.4 on Databricks. Let me check on 2.3 – thebluephantom Nov 26 '18 at 17:59
Sorry, I was on Spark 2.3, Python 2 - I forgot – thebluephantom Nov 26 '18 at 18:00
Databricks is a proprietary extension and supported higher ordered functions for a long time, but [Spark supports it only since 2.4.0](https://issues.apache.org/jira/browse/SPARK-23909). But what does this add compared to a linked duplicate? – 10465355 Nov 26 '18 at 18:00
So, AFAYK, that is not true – thebluephantom Nov 26 '18 at 18:01
My knowledge and I don't always search for duplicates - have a nice evening – thebluephantom Nov 26 '18 at 18:01
Would anyone care to generalize this as an answer for [my related question](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) with the version caveat provided by @user10465355 – pault Nov 26 '18 at 18:02

score 0 · Accepted Answer · answered Nov 27 '18 at 21:55

0

This is the solution

filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))

answered Nov 27 '18 at 21:55

vndywarhol

169
1
1
7

How to remove words that have less than three letters in PySpark?

2 Answers2