0

I'm trying to trim the left and right white spaces in any given DataFrame, but only in string columns (so as to not alter the schema of the DataFrame). Another solution would be to trim all columns, and infer the schema or replace the schema after trimming. But I'm not sure how to do that either... this is what I'm doing now.

from pyspark.sql.functions import col

mmDF.printSchema()
columnList = [item[0] for item in mmDF.dtypes if item[1].startswith('string')]

mmDF = mmDF.withColumn(col, func.ltrim(func.rtrim(mmDF[col] for mmDF_col in columnList)))

mmDF.show()

mmDF.printSchema()

Trimming line causes error:

TypeError: Invalid argument, not a string or column: <generator object <genexpr> at 0x0000027D5C63E248> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
splatch
  • 11
  • 1
  • if u want to loop over ur columns to apply func u should use select or use reduce with withColumn – murtihash Jul 29 '20 at 16:42
  • Something like mmDF = mmDF.select(func.ltrim(func.rtrim(mmDF[col])) for mmDF_col in columnList) ? – splatch Jul 29 '20 at 16:58
  • `mmDF.select(*(func.ltrim(func.rtrim(func.col(x))).alias(x) for x in columnList))` – murtihash Jul 29 '20 at 17:08
  • Ohh, thanks so much. Sorry, one last issue... do you know how to "update" the original DF string columns with the trimmed strings? I know DFs are immutable and can't be changed... – splatch Jul 29 '20 at 17:24

1 Answers1

0

Answer is found here. Essentially you are selecting string columns with the select_dtypes command found in pandas and then applying str.trim() over all subsetted columns.

bdempe
  • 308
  • 2
  • 9