0

I have a code similar to this:

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

def regex_filter(x):
    regexs = ['.*123.*']

    if x and x.strip():
        for r in regexs:
            if re.match(r, x, re.IGNORECASE):
                return True

    return False 


filter_udf = udf(regex_filter, BooleanType())

df_filtered = df.filter(filter_udf(df.fieldXX))

I want to use "regexs" var to verify if any digit "123" is in "fieldXX"

i don't know what i did wrong! Could anyone help me with this?

T.Doe
  • 1
  • 1
    We can't tell what you did wrong unless we have some sample input and output. Are you getting an error? The wrong answer? Read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide a [mcve]. – pault Aug 06 '18 at 14:43

2 Answers2

0

Regexp is incorrect.

I think it should be something like:

regexs = '.*[123].*'

vvg
  • 6,325
  • 19
  • 36
0

You can use SQL function to attain this

df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")

Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.

Arun Gunalan
  • 814
  • 7
  • 26
  • *Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.*? Would you care to clarify that statement? – pault Aug 06 '18 at 16:23
  • After using UDF you are taking the result in a data frame df_1, you cannot save it back to HDFS or you cannot use df_1 to do further manipulation, it will through error – Arun Gunalan Aug 06 '18 at 16:26
  • That is absolutely not true. Please show me some documentation to support this claim. – pault Aug 06 '18 at 16:28
  • I experienced this when I am using UDF, you can try for yourself – Arun Gunalan Aug 06 '18 at 16:35