Trimming white space in only string columns of a DataFrame

Question

I'm trying to trim the left and right white spaces in any given DataFrame, but only in string columns (so as to not alter the schema of the DataFrame). Another solution would be to trim all columns, and infer the schema or replace the schema after trimming. But I'm not sure how to do that either... this is what I'm doing now.

from pyspark.sql.functions import col

mmDF.printSchema()
columnList = [item[0] for item in mmDF.dtypes if item[1].startswith('string')]

mmDF = mmDF.withColumn(col, func.ltrim(func.rtrim(mmDF[col] for mmDF_col in columnList)))

mmDF.show()

mmDF.printSchema()

Trimming line causes error:

TypeError: Invalid argument, not a string or column: <generator object <genexpr> at 0x0000027D5C63E248> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

if u want to loop over ur columns to apply func u should use select or use reduce with withColumn — murtihash, Jul 29 '20 at 16:42
Something like mmDF = mmDF.select(func.ltrim(func.rtrim(mmDF[col])) for mmDF_col in columnList) ? — splatch, Jul 29 '20 at 16:58
`mmDF.select(*(func.ltrim(func.rtrim(func.col(x))).alias(x) for x in columnList))` — murtihash, Jul 29 '20 at 17:08
Ohh, thanks so much. Sorry, one last issue... do you know how to "update" the original DF string columns with the trimmed strings? I know DFs are immutable and can't be changed... — splatch, Jul 29 '20 at 17:24

score 0 · Answer 1 · answered Jul 29 '20 at 16:48

0

Answer is found here. Essentially you are selecting string columns with the select_dtypes command found in pandas and then applying str.trim() over all subsetted columns.

answered Jul 29 '20 at 16:48

bdempe

308
2
9

Unfortunately that solution is only for a pandas DF, not pyspark – splatch Jul 29 '20 at 16:52
My apologies. I saw DataFrame and tunneled visioned on pandas. – bdempe Jul 29 '20 at 17:19

Trimming white space in only string columns of a DataFrame

1 Answers1