1

I have a dataframe with columns containing carriage returns, line feeds and tabs. I found a posting with a solution for pandas:

replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=<INPLACE>)

how can I do this in a spark dataframe?

Bill S
  • 43
  • 2
  • 8

1 Answers1

2

To replace carriage returns, line feeds and tabs, you can use \s

\s = [ \t\n\r\f\v]

There is the pyspark code you need to do the replacement in all your dataframe columns:

from pyspark.sql import functions as F

df = spark.createDataFrame([("\ttext1", 'text2\n', 'te\rxt3'), ("text1\t", '\ntext2', 't\rext3')], ['col1', 'col2', 'col3'])

expr = [F.regexp_replace(F.col(column), pattern="\s+", replacement=",").alias(column) for column in df.columns]

df.select(expr).show()

+------+------+------+
|  col1|  col2|  col3|
+------+------+------+
|,text1|text2,|te,xt3|
|text1,|,text2|t,ext3|
+------+------+------+
Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19