7

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more columns, I tried using a for-in loop executing withColumn (see example bellow), but normally when I run the code, it shows a Stack Overflow (it rarely works), this DataFrame is not big at all, it has just ~15000 records.

# df is a DataFrame
def lowerCase(string):
    return string.strip().lower()

lowerCaseUDF = udf(lowerCase, StringType())

for (columnName, kind) in df.dtypes:
    if(kind == "string"):
        df = df.withColumn(columnName, lowerCaseUDF(df[columnName]))

df.select("Tipo_unidad").distinct().show()

The complete error is very long, therefore I decided to paste only some lines. But you can find the full trace here Complete Trace

Py4JJavaError: An error occurred while calling o516.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, worker2.mcbo.mood.com.ve): java.lang.StackOverflowError at java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:2774)

I am thinking that this problem is produced because this code launches many jobs (one for each column of type string), could you show me another alternative or what I am doing wrong?

zero323
  • 322,348
  • 103
  • 959
  • 935
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • 1
    How many columns do you have? – zero323 Jan 28 '16 at 16:22
  • 1
    @eliasah around 136, I think that they aren't too many – Alberto Bonsanto Jan 28 '16 at 16:23
  • 1
    I think the loop is keeping the dataframe in memory each time you are computing on it and the GC doesn't have time to clean it thus no memory => SO – eliasah Jan 28 '16 at 16:23
  • 1
    @eliasah It's very probable, but I don't have any other user friendly alternative (the other one will be to do this manually column by column) – Alberto Bonsanto Jan 28 '16 at 16:25
  • 1
    Could you try to use a single select instead? This SO smells like some kind of issue with growing lineage. Also I wouldn't use UDF here. It is kind of wasteful and can be handled directly on internal representation. – zero323 Jan 28 '16 at 16:30
  • @zero323 that was exactly my point ! – eliasah Jan 28 '16 at 16:31
  • @zero323 Excuse me, but I didn't understand exactly what you mean saying "try to use a single select", I tried one select on the end (that's what shows the error) – Alberto Bonsanto Jan 28 '16 at 16:31

1 Answers1

14

Try something like this:

from pyspark.sql.functions import col, lower, trim

exprs = [
    lower(trim(col(c))).alias(c) if t == "string" else col(c) 
    for (c, t) in df.dtypes
]

df.select(*exprs)

This approach has two main advantages over you current solution:

  • it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
  • it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    Worked perfectly, but how would I do if I have to apply a really complex function, in every string column – Alberto Bonsanto Jan 28 '16 at 16:52
  • 4
    Well, pretty much the same way :) If you cannot use expression (in 1.6 Spark it shouldn't be a problem - there is enough to choose so you can create arbitrary complex transformation) just replace `lower ∘ trim` with an UDF. – zero323 Jan 28 '16 at 16:58