Pyspark substring is not working inside of UDF

Question

I'm trying in vain to use a Pyspark substring function inside of an UDF. Below is my code snippet -

from pyspark.sql.functions import substring

def my_udf(my_str):
    try:
        my_sub_str = substring(my_str,1, 2)
    except Exception:
        pass
    else:
        return (my_sub_str)

apply_my_udf = udf(my_udf)

df = input_data.withColumn("sub_str", apply_my_udf(input_data.col0))

The sample data is-

ABC1234
DEF2345
GHI3456

But when I print the df, I don't get any value in the new column "sub_str" as shown below -

[Row(col0='ABC1234', sub_str=None), Row(col0='DEF2345', sub_str=None), Row(col0='GHI3456', sub_str=None)]

Can anyone please let me know what I'm doing wrong?

This is because [you can not use any of the `pyspark.sql.functions` inside of a `udf`](https://stackoverflow.com/questions/42691899/can-pyspark-sql-function-be-used-in-udf). You also can not [reference a spark DataFrame inside a `udf`](https://stackoverflow.com/questions/50123238/pyspark-use-dataframe-inside-udf). Since you have a [naked except](https://stackoverflow.com/questions/14797375/should-i-always-specify-an-exception-type-in-except-statements), you're swallowing the real error message and returning `None` as that's what python functions do when there is no `return`. — pault, Feb 06 '20 at 16:54

score 1 · Answer 1 · answered Feb 06 '20 at 15:33

1

You don't need a udf to use substring, here's a cleaner and faster way:

>>> from pyspark.sql import functions as f
>>> df.show()
+-------+
|   data|
+-------+
|ABC1234|
|DEF2345|
|GHI3456|
+-------+

>>> df.withColumn("sub_str", f.substring("data", 1, 2)).show()
+-------+-------+
|   data|sub_str|
+-------+-------+
|ABC1234|     AB|
|DEF2345|     DE|
|GHI3456|     GH|
+-------+-------+

answered Feb 06 '20 at 15:33

Mohamed Ali JAMAOUI

14,275
14
73
117

1

don't use udf-s when you can avoid them +1 – KGS Feb 06 '20 at 19:06

score 1 · Answer 2 · answered Feb 06 '20 at 15:38

If you need to use udf for that, you could also try something like:

input_data = spark.createDataFrame([
    (1,"ABC1234"), 
    (2,"DEF2345"),
    (3,"GHI3456")
], ("id","col0"))

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

udf1 = udf(lambda x:x[0:2],StringType())
df.withColumn('sub_str',udf1('col0')).show()

+---+-------+-------+
| id|   col0|sub_str|
+---+-------+-------+
|  1|ABC1234|     AB|
|  2|DEF2345|     DE|
|  3|GHI3456|     GH|
+---+-------+-------+

However, as Mohamed Ali JAMAOUI wrote - you could do without udf easily here.

Pyspark substring is not working inside of UDF

2 Answers2