3

I do an aggregation sum on all columns in a dataframe with a code similar like this:

   df_aggregated = df.groupBy('key').sum()

Result columns have changed his name for a pattern like this:

   sum(colum1), sum(colum2)

I need change all new column's name to previous name

I tried to use 'alias', but I have too many columns and I would like something more automatic

1 Answers1

4

You can use this function. Any aggregation have the same pattern: xxx(column_name) If you have a pattern use a regular expression to match it.

import re

def rename_columns_after_any_aggregation(df):
    for name in df.schema.names:
        clear_name = ''
        m = re.search('\((.*?)\)', name)
        if m:
            clear_name = m.group(1)
            df = df.withColumnRenamed(name, clear_name)

    return df
mjimcua
  • 2,781
  • 3
  • 27
  • 47
  • 1
    Thank you so much! I've been a long time trying to do this but I didn't even think this way :) – Pablo Suaña Cuesta Aug 29 '18 at 18:03
  • For those coming here looking for a scala solution: `df.select(df.columns.map(c => col(c).as(raw"^.+\((.*?)\)".r.replaceAllIn(c, m => m.group(1)))): _*)`. This will strip out the col name to only what is inside the parentheses. Now if only Stack Overflow would stop closing questions and pointing to other threads that don't actually answer OPs question... – Brendan Jul 24 '20 at 19:02
  • See relevant thread here for more details: https://stackoverflow.com/questions/53002360/spark-expression-rename-the-column-list-after-aggregation/63080089#63080089 – Brendan Jul 24 '20 at 19:12
  • 1
    This may break when you will switch from Spark 2 to Spark 3 as aggregate column name generation logic has changed for some of them – proggeo Oct 08 '21 at 09:19