5

Summary:

I have many columns with data (data_cols) and with text (text_cols), I'd like to do the following by referring to a list of column names, but cannot figure it out: df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})

Explanation:

I have a dataframe with ~30 columns, some of the columns contain values and the others contain text. I would like to use sum all values with the same id and for the text to use the first entry. I can achieve this by using groupby:

d = {'id': ['a', 'a', 'b', 'c'], 'value1': [1, 2, 3, 4], 'value2': [5, 6, 7, 8], 'text1': ['w', 'x', 'y', 'z']}
df = pd.DataFrame(d)

  id  value1  value2 text1
0  a       1       5     w
1  a       2       6     x
2  b       3       7     y
3  c       4       8     z

df.groupby('id', as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'text1': 'first'})

  id  value1  value2 text1
0  a       3      11     w
1  b       3       7     y
2  c       4       8     z

This is exactly what I would like to achieve, except that I have many columns and would prefer not to write out all column names. So I have tried different things to reference multiple columns, but I don't get any to work.

data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)

df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})

Here, I get TypeError: unhashable type: 'Index', so I thought I could avoid this by using tuple:

data_cols = tuple(df.columns[1:3])
text_cols = tuple(set(df.columns) - set(data_cols)) #I have many data columns and some text columbs before and after them

Which gives me: SpecificationError: Column(s) [('text1', 'id'), ('value1', 'value2')] do not exist I think it views ('value1', 'value2') as a single column, instead of unpacking it into the two columns. Is there any way or format to avoid this?

Alternatively I can split the df into 2 dfs: df_values and df_text, do the grouping on df_values and then join them back together, but this seems cumbersome and I imagine there to be a better way.

Dianne
  • 73
  • 1
  • 8

3 Answers3

5

Create dictionaries by dict.fromkeys and merge them, last pass to agg:

data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)
d1 = dict.fromkeys(data_cols, 'sum')
d2 = dict.fromkeys(text_cols, 'first')

#https://stackoverflow.com/questions/38987
d = {**d1, **d2}

df.groupby('id', as_index=False).agg(d)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

You need to first convert the column names to a dict with desired operations and then merge the dict.

(
    df.groupby('id', as_index=False)
    .agg({**{e:'sum' for e in data_cols}, **{e:'first' for e in text_cols}})
)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • Thank you, Allen! this works great too, perfect for a one-liner. I decided to mark the other answer as accepted because I understand what is happening slightly better. I can of course change that if someone has a reason for why this answer is better. – Dianne May 27 '21 at 07:41
0

If you have the columns as a list. I would highly recommend using the following method -

df.groupby('col', as_index=False)[columns_list].agg('sum')

P.S - The as_index=False is an important parameter

Worked like a charm for me!

Mohseen Mulla
  • 542
  • 7
  • 15
  • 1
    Thank you for your suggestion! This does not seem to allow me to use 'sum' for part of the columns and 'first' for other columns. The other answers leave all flexibility and I have also used them to use more than two aggregation methods, which I also do not think is possible here. – Dianne Apr 19 '22 at 09:38