1

I am trying to create a generic function to aggregate a Pandas dataframe to a single row per part number.

The dataframe is being created from reading in a csv which is huge, it is 115,371 x 174. For a minimal, complete, verifiable example, I have reduced the data to the table below:

| Part | Plant | Make_Buy |
|:----:|:-----:|:--------:|
|   A  |       |    Buy   |
|   A  |  XXX  |   Make   |
|   A  |  XXX  |    Buy   |
|   A  |  XXY  |    Buy   |
|   A  |  XXY  |   Make   |
|   B  |   YY  |   Make   |
|   B  |  XXY  |   Make   |

Here also as Python code:

import pandas as pd

d = {'part': ['A', 'A', 'A', 'A', 'A', 'B', 'B'], 
     'plant': [None, 'XXX', 'XXX', 'XXY', 'XXY', 'YY', 'XXY'],
     'make_buy': ['Buy', 'Make', 'Buy', 'Buy', 'Make', 'Make', 'Make']}
df = pd.DataFrame(data=d)

The ideal function would be able to do the following:

For any column that contains multiple values for a single part, aggregate them.

The ideal dataframe would look like:

| Part |   Plant  |  Make_Buy |
|:----:|:--------:|:---------:|
|   A  | XXX, XXY | Buy, Make |
|   B  |  YY, XXY |    Make   |

The issue is, there are 175 columns. So I need something that is generic enough to scan through each column, if there are multiple lines, apply the aggregation. I know in Oracle SQL I can do:

SELECT
Part,
LISTAGG(col1, ', ') AS col1,
LISTAGG(col2, ', ') AS col2,
etc

FROM
table

I have looked into aggregation via text columns via the following post, but I don't know how to generically change:

df['part'] = df[['part','plant','make_buy']].groupby(['part'])['part'].transform(lambda x: ','.join(x))
df[['part','plant','make_buy']].drop_duplicates()

To every column

jkortner
  • 549
  • 2
  • 8
  • 23
artemis
  • 6,857
  • 11
  • 46
  • 99

1 Answers1

3

You can use sets for this:

df.groupby('Part').agg(set)

          Plant     Make_Buy
Part                         
A     {XXY, XXX}  {Buy, Make}
B      {YY, XXY}       {Make}

To have the result as strings:

df.groupby('Part').agg(lambda x: ', '.join(set(x)))

       Plant   Make_Buy
Part                     
A     XXY, XXX  Buy, Make
B      YY, XXY       Make
yatu
  • 86,083
  • 12
  • 84
  • 139
  • `df.groupby('Part').agg(lambda x: ', '.join(set(x)))` only returns the `make_buy` variable. – jkortner Mar 12 '20 at 15:10
  • 1
    @wundermahn: If you want to have `df.groupby('part').agg(set)`as lists you can do `df.groupby('part').agg(set).apply(lambda x: x.apply(list))`. – jkortner Mar 12 '20 at 15:35
  • I'd prefer for it to not be a data structure in the column, as I'm going to end up dummy encoding it using `sklearn`. Is there any way to keep it as a string, but comma separated? @ybml – artemis Mar 12 '20 at 15:42
  • @wundermahn that's what the second part of the answer does? – anky Mar 12 '20 at 15:53