Stack Overflow while processing several columns with a UDF

Question

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a for-in loop executing withColumn (see example bellow), but normally when I run the code, it shows a Stack Overflow (it rarely works), this DataFrame is not big at all, it has just ~15000 records.

# df is a DataFrame
def lowerCase(string):
    return string.strip().lower()

lowerCaseUDF = udf(lowerCase, StringType())

for (columnName, kind) in df.dtypes:
    if(kind == "string"):
        df = df.withColumn(columnName, lowerCaseUDF(df[columnName]))

df.select("Tipo_unidad").distinct().show()

The complete error is very long, therefore I decided to paste only some lines. But you can find the full trace here Complete Trace

Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774)

I am thinking that this problem is produced because this code launches many jobs (one for each column of type string), could you show me another alternative or what I am doing wrong?

I think the loop is keeping the dataframe in memory each time you are computing on it and the GC doesn't have time to clean it thus no memory => SO — eliasah, Jan 28 '16 at 16:23
@eliasah It's very probable, but I don't have any other user friendly alternative (the other one will be to do this manually column by column) — Alberto Bonsanto, Jan 28 '16 at 16:25
Could you try to use a single select instead? This SO smells like some kind of issue with growing lineage. Also I wouldn't use UDF here. It is kind of wasteful and can be handled directly on internal representation. — zero323, Jan 28 '16 at 16:30
@zero323 Excuse me, but I didn't understand exactly what you mean saying "try to use a single select", I tried one select on the end (that's what shows the error) — Alberto Bonsanto, Jan 28 '16 at 16:31

zero323 · Accepted Answer · 2016-01-28T16:54:53.980

14

Try something like this:

from pyspark.sql.functions import col, lower, trim

exprs = [
    lower(trim(col(c))).alias(c) if t == "string" else col(c) 
    for (c, t) in df.dtypes
]

df.select(*exprs)

This approach has two main advantages over you current solution:

it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).

edited Jan 28 '16 at 16:54

answered Jan 28 '16 at 16:40

zero323

322,348
103
959
935

2

Worked perfectly, but how would I do if I have to apply a really complex function, in every string column – Alberto Bonsanto Jan 28 '16 at 16:52
4

Well, pretty much the same way :) If you cannot use expression (in 1.6 Spark it shouldn't be a problem - there is enough to choose so you can create arbitrary complex transformation) just replace `lower ∘ trim` with an UDF. – zero323 Jan 28 '16 at 16:58

Stack Overflow while processing several columns with a UDF

1 Answers1

Linked