3

I have my input spark-dataframe named df as,

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+

I need to remove the special characters from the column names of df like following,

  • Remove +

  • Replace space as underscore

  • Replace dot as underscore

So my df should be like

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+

Using Scala, I have achieved this by,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed

When I used,

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

I got the first column name as 126 Concentrate instead of getting 126_Concentrate

But I don't prefer 3 for loops for this replacement. Can I get the solution?

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
PRIYA M
  • 181
  • 2
  • 3
  • 19

5 Answers5

11
df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show
Momer
  • 3,158
  • 22
  • 23
Chandan Ray
  • 2,031
  • 1
  • 10
  • 15
  • Yes, it works perfect. According to my use case, I have changed the solution as, `df.columns.foldLeft(df){(newdf, colname) => newdf.withColumnRenamed(colname,colname.replace(" ", "_").replace(".", "_").replace("+",""))}` – PRIYA M Jun 29 '18 at 10:03
8

You can use withColumnRenamed regex replaceAllIn and foldLeft as below

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)

which should give you

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
0

In java you can iterate over column names using df.columns() and replace each header string with string replaceAll(regexPattern, IntendedCharreplacement)

Then use withColumnRenamed(headerName, correctedHeaderName) to rename df header.

eg -

for (String headerName : dataset.columns()) {
    String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
    dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();
Bùi Đức Khánh
  • 3,975
  • 6
  • 27
  • 43
NiharGht
  • 151
  • 5
  • 10
0

Piggybacking Ramesh's answer, here is a reusable function using the currying syntax with the .transform() method & makes the columns lower case:

// Format all column names with regex with lower_case names
def formatAllColumns(regex_string:String)(df: DataFrame): DataFrame = {
  val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
  val resultDF:DataFrame = replacingColumns.zip(df.columns).foldLeft(df){
    (tempdf, name) => tempdf.withColumnRenamed(name._2, name._1.toLowerCase())
  }
  resultDF
}
val resultDF = df.transform(formatAllColumns(regex_string="""[+._(), ]+"""))
kevin_theinfinityfund
  • 1,631
  • 17
  • 18
0

We can remove all the characters just by mapping column_name with new name after replacing special characters using replaceAll for the respective character and this single line of code is tried and tested with spark scala.

df.select(
          df.columns
            .map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
        ).show(false)
akgarg511
  • 1
  • 2
  • While this code may solve the problem the answer would be a lot better with an explanation on how/why it does. Remember that your answer is not just for the user that asked the question but also for all the other people that find it. – Sabito stands with Ukraine Dec 28 '20 at 16:40