-2

I have a dataframe with following columns

Name   Company generic name generic name R&D Number (DC-) R&D Number (A)   type 
A      AB      53654        5767         1111             3333             a
C      CD      56767                     56667       

I would like to create a subset of this dataframe and combine columns with a pattern by aggregating the values by a ','

The desired output is:

Name   Company generic name        R&D Number   
A      AB      53654, 5767         1111, 3333             
C      CD      56767               56667       

I found there is a way to filter the columns based on regex like below df.filter(regex=("R&D Number.*"))

But is there a way to join with the other columns and concatenate the value by ',' to get the final output. Any help is highly appreciated

rshar
  • 1,381
  • 10
  • 28
  • 2
    can you [provide your input as DataFrame constructor](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for clarity? – mozway Jun 23 '23 at 10:19

1 Answers1

1

You can normalize the names by removing the trailing (xx) part (adapt the exact logic if needed), then groupby on the columns to aggregate the values:

out = (df.groupby(df.columns.str.replace(r'\s*\(.+\)$', '', regex=True), axis=1)
         .agg(lambda g: g.replace('', None).stack().astype(str)
                         .groupby(level=0).agg(','.join))
      )

Or using a double transpose trick:

out = (df.T
       .groupby(df.columns.str.replace(r'\s*\(.+\)$', '', regex=True))
       .agg(lambda g: ','.join(g.replace('', None).dropna().astype(str)))
       .T
      )

Output:

  Company Name R&D Number generic name type
0      AB    A  1111,3333   53654,5767    a
1      CD    C      56667        56767  NaN

Reproducible input:

df = pd.DataFrame.from_dict({'index': [0, 1],
                             'columns': ['Name', 'Company', 'generic name', 'generic name', 'R&D Number (DC-)', 'R&D Number (A)', 'type'],
                             'data': [['A', 'AB', '53654', '5767', '1111', '3333', 'a'], ['C', 'CD', '56767', '', '56667', '', '']],
                             'index_names': [None],
                             'column_names': [None]},
                            orient='tight')
mozway
  • 194,879
  • 13
  • 39
  • 75