Replace Special characters of column names in Spark dataframe

Question

I have my input spark-dataframe named df as,

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+

I need to remove the special characters from the column names of df like following,

Remove +
Replace space as underscore
Replace dot as underscore

So my df should be like

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+

Using Scala, I have achieved this by,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed

When I used,

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

I got the first column name as 126 Concentrate instead of getting 126_Concentrate

But I don't prefer 3 for loops for this replacement. Can I get the solution?

score 11 · Accepted Answer · edited Aug 10 '18 at 15:49

11

df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show

edited Aug 10 '18 at 15:49

Momer

3,158
22
23

answered Jun 29 '18 at 09:32

Chandan Ray

2,031
1
10
15

Yes, it works perfect. According to my use case, I have changed the solution as, `df.columns.foldLeft(df){(newdf, colname) => newdf.withColumnRenamed(colname,colname.replace(" ", "_").replace(".", "_").replace("+",""))}` – PRIYA M Jun 29 '18 at 10:03

score 8 · Answer 2 · answered Jun 29 '18 at 09:42

You can use withColumnRenamed regex replaceAllIn and foldLeft as below

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)

which should give you

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+

I hope the answer is helpful

score 0 · Answer 3 · edited Oct 04 '19 at 08:21

In java you can iterate over column names using df.columns() and replace each header string with string replaceAll(regexPattern, IntendedCharreplacement)

Then use withColumnRenamed(headerName, correctedHeaderName) to rename df header.

eg -

for (String headerName : dataset.columns()) {
    String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
    dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();

score 0 · Answer 4 · answered Dec 31 '19 at 21:19

Piggybacking Ramesh's answer, here is a reusable function using the currying syntax with the .transform() method & makes the columns lower case:

// Format all column names with regex with lower_case names
def formatAllColumns(regex_string:String)(df: DataFrame): DataFrame = {
  val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
  val resultDF:DataFrame = replacingColumns.zip(df.columns).foldLeft(df){
    (tempdf, name) => tempdf.withColumnRenamed(name._2, name._1.toLowerCase())
  }
  resultDF
}
val resultDF = df.transform(formatAllColumns(regex_string="""[+._(), ]+"""))

akgarg511 · Answer 5 · 2020-12-28T17:16:25.707

0

We can remove all the characters just by mapping column_name with new name after replacing special characters using replaceAll for the respective character and this single line of code is tried and tested with spark scala.

df.select(
          df.columns
            .map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
        ).show(false)

edited Dec 28 '20 at 17:16

answered Dec 28 '20 at 16:38

akgarg511

1
2

While this code may solve the problem the answer would be a lot better with an explanation on how/why it does. Remember that your answer is not just for the user that asked the question but also for all the other people that find it. – Sabito stands with Ukraine Dec 28 '20 at 16:40

Replace Special characters of column names in Spark dataframe

5 Answers5

Linked