Pyspark regex to data frame

Question

I have a code similar to this:

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

def regex_filter(x):
    regexs = ['.*123.*']

    if x and x.strip():
        for r in regexs:
            if re.match(r, x, re.IGNORECASE):
                return True

    return False 


filter_udf = udf(regex_filter, BooleanType())

df_filtered = df.filter(filter_udf(df.fieldXX))

I want to use "regexs" var to verify if any digit "123" is in "fieldXX"

i don't know what i did wrong! Could anyone help me with this?

We can't tell what you did wrong unless we have some sample input and output. Are you getting an error? The wrong answer? Read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide a [mcve]. — pault, Aug 06 '18 at 14:43

score 0 · Answer 1 · answered Aug 06 '18 at 15:12

0

Regexp is incorrect.

I think it should be something like:

regexs = '.*[123].*'

answered Aug 06 '18 at 15:12

vvg

6,325
19
36

score 0 · Answer 2 · answered Aug 06 '18 at 15:16

0

You can use SQL function to attain this

df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")

Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.

answered Aug 06 '18 at 15:16

Arun Gunalan

814
7
26

*Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.*? Would you care to clarify that statement? – pault Aug 06 '18 at 16:23
After using UDF you are taking the result in a data frame df_1, you cannot save it back to HDFS or you cannot use df_1 to do further manipulation, it will through error – Arun Gunalan Aug 06 '18 at 16:26
That is absolutely not true. Please show me some documentation to support this claim. – pault Aug 06 '18 at 16:28
I experienced this when I am using UDF, you can try for yourself – Arun Gunalan Aug 06 '18 at 16:35

Pyspark regex to data frame

2 Answers2