2

I try to aggregate based on one column "col1" like this and keep same value from other columns:

df_input 

       col1      col2    col3    col4  
0       ID1      DE      69      min-8     
1       ID1      DZ      69      min-8
2       ID3      DA      54      min-15
3       ID3      AC      54      min-15
4       ID3      RC      54      min-15
5       ID8      UC      2       min-40
6       ID8      TC      2       min-40
7       ID8      VC      2       min-40
8       ID8      WC      2       min-40
9       ID7      XC      4       min-7



df_output 

           col1      col2             col3    col4  
    0       ID1      DE,DZ            69      min-8     
    1       ID3      DA,AC,RC         54      min-15
    2       ID8      UC,TC,VC,WC      2       min-40
    3       ID7      XC               4       min-7
Julie Hardy
  • 127
  • 8

2 Answers2

0

Let's try breaking the below code.

  • First, group your dataframe by col1, and then perform .agg on the grouped object.
  • We will then use a lambda function on col2 to get all of it's elements in a list
  • Let's use the argument 'first', to show that we want to keep only the first element of col3 and col4
  • Then, reset the index.
agg_df = (df.groupby('col1')
      .agg({'col2': lambda x: x.tolist(),'col3':'first','col4':'first'})
      .reset_index())     

print(agg_df)

  col1              col2  col3    col4
0  ID1          [DE, DZ]    69   min-8
1  ID3      [DA, AC, RC]    54  min-15
2  ID7              [XC]     4   min-7
3  ID8  [UC, TC, VC, WC]     2  min-40

To then convert col2 from having it's values stored in a list to proper string, we can join it's elements using a ,:

agg_df['col2'].apply(lambda x: ','.join(str(i) for i in x))

Out[16]: 
0          DE,DZ
1       DA,AC,RC
2             XC
3    UC,TC,VC,WC
Name: col2, dtype: object  
sophocles
  • 13,593
  • 3
  • 14
  • 33
  • 1
    Thank your for your fast reply and the details of the explanation !! – Julie Hardy Jan 09 '21 at 10:28
  • Glad I could help. – sophocles Jan 09 '21 at 10:32
  • I got another question do you know if there is the method when you don't specify the colnames ? I try to join when the value is different and aggregate when the value is identical without specify the colnames ? I don't know if it is possible ? – Julie Hardy Jan 09 '21 at 10:44
  • I am not quite sure I understand what you mean. Can you please elaborate a bit more? – sophocles Jan 09 '21 at 11:04
  • 1
    Yes sorry, it is not really clear on my head but I was wondering if we can only specifiy and aggregated value from col1 and the others columns without specify colnames, because I got a big dataframe; But it is ok I just specify for each columns from the dataframe to keep the first element or join elements – Julie Hardy Jan 09 '21 at 11:43
  • please check this out: https://stackoverflow.com/questions/47360510/pandas-groupby-and-aggregation-output-should-include-all-the-original-columns-i. You might find what you need. – sophocles Jan 09 '21 at 12:00
0

Based on Sophods answer, one line of code should resolve your problem

df = df.groupby('col1').agg({'col2':lambda x: ",".join(x),'col3':np.mean,'col4':'first'}).reset_index()
Frank Z.
  • 23
  • 4