6

How can I create new rows from an existing DataFrame by grouping by certain fields (in the example "Country" and "Industry") and applying some math to another field (in the example "Field" and "Value")?

Source DataFrame

df = pd.DataFrame({'Country': ['USA','USA','USA','USA','USA','USA','Canada','Canada'],
                   'Industry': ['Finance', 'Finance', 'Retail', 
                                'Retail', 'Energy', 'Energy', 
                                'Retail', 'Retail'],
                   'Field': ['Import', 'Export','Import', 
                             'Export','Import', 'Export',
                             'Import', 'Export'],
                   'Value': [100, 50, 80, 10, 20, 5, 30, 10]})

    Country Industry    Field   Value
0   USA     Finance     Import  100
1   USA     Finance     Export  50
2   USA     Retail      Import  80
3   USA     Retail      Export  10
4   USA     Energy      Import  20
5   USA     Energy      Export  5
6   Canada  Retail      Import  30
7   Canada  Retail      Export  10

Target DataFrame

Net = Import - Export

    Country Industry    Field   Value
0   USA     Finance     Net     50
1   USA     Retail      Net     70
2   USA     Energy      Net     15
3   Canada  Retail      Net     20
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
Lorenz
  • 266
  • 1
  • 5
  • 16

5 Answers5

10

There are quite possibly many ways. Here's one using groupby and unstack:

(df.groupby(['Country', 'Industry', 'Field'], sort=False)['Value']
   .sum()
   .unstack('Field')
   .eval('Import - Export')
   .reset_index(name='Value'))

  Country Industry  Value
0     USA  Finance     50
1     USA   Retail     70
2     USA   Energy     15
3  Canada   Retail     20
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    By far the best answer. The `unstack` followed by `eval` is a really nice trick — better than a second `groupby` and `get_group` I would have done – BallpointBen Apr 13 '19 at 22:17
  • 1
    @BallpointBen `eval` and `query` are personal favourites of mine from the API. I've made attempts to popularise their use, but their usage is not completely understood. I have a QnA [here](https://stackoverflow.com/a/53779987/4909087), if you are interested. – cs95 Apr 13 '19 at 22:19
  • Works like a charm. Thank you very much. Very small comment - there is a closing bracket missing in the last line. – Lorenz Apr 14 '19 at 01:40
  • @coldspeed Actually I think there’s a better way… see my answer. `unstack` is expensive because it reshapes. Using the structure of the first groupby is more efficient, although it takes two lines. – BallpointBen Apr 14 '19 at 03:03
  • @BallpointBen very crisp solution, nice! Although, I still like mine because it is a (still readable) one liner. – cs95 Apr 14 '19 at 03:22
4

IIUC

df=df.set_index(['Country','Industry'])

Newdf=(df.loc[df.Field=='Export','Value']-df.loc[df.Field=='Import','Value']).reset_index().assign(Field='Net')
Newdf
  Country Industry  Value Field
0     USA  Finance    -50   Net
1     USA   Retail    -70   Net
2     USA   Energy    -15   Net
3  Canada   Retail    -20   Net

pivot_table

df.pivot_table(index=['Country','Industry'],columns='Field',values='Value',aggfunc='sum').\
  diff(axis=1).\
     dropna(1).\
        rename(columns={'Import':'Value'}).\
          reset_index()
Out[112]: 
Field Country Industry  Value
0      Canada   Retail   20.0
1         USA   Energy   15.0
2         USA  Finance   50.0
3         USA   Retail   70.0
BENY
  • 317,841
  • 20
  • 164
  • 234
3

You can do it this way to add those rows to your original dataframe:

df.set_index(['Country','Industry','Field'])\
  .unstack()['Value']\
  .eval('Net = Import - Export')\
  .stack().rename('Value').reset_index()

Output:

   Country Industry   Field  Value
0   Canada   Retail  Export     10
1   Canada   Retail  Import     30
2   Canada   Retail     Net     20
3      USA   Energy  Export      5
4      USA   Energy  Import     20
5      USA   Energy     Net     15
6      USA  Finance  Export     50
7      USA  Finance  Import    100
8      USA  Finance     Net     50
9      USA   Retail  Export     10
10     USA   Retail  Import     80
11     USA   Retail     Net     70
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks - actually, I wanted to append it to the original df. So, nice trick to do this all in one command, – Lorenz Apr 14 '19 at 01:41
  • 1
    Coldspeed‘s answer was a slight better fit to my overall code. Took from your code how you appended the result to the original df. Very tight result, though. Pitty that i can not accept two answers. But thanks again! – Lorenz Apr 14 '19 at 03:07
2

You can use Groupby.diff() and after that recreate the Field column and finally use DataFrame.dropna:

df['Value'] = df.groupby(['Country', 'Industry'])['Value'].diff().abs()
df['Field'] = 'Net'
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

print(df)
  Country Industry Field  Value
0     USA  Finance   Net   50.0
1     USA   Retail   Net   70.0
2     USA   Energy   Net   15.0
3  Canada   Retail   Net   20.0
Erfan
  • 40,971
  • 8
  • 66
  • 78
2

This answer takes advantage of the fact that pandas puts the group keys in the multiindex of the resulting dataframe. (If there were only one group key, you could use loc.)

>>> s = df.groupby(['Country', 'Industry', 'Field'])['Value'].sum()
>>> s.xs('Import', axis=0, level='Field') - s.xs('Export', axis=0, level='Field')
Country  Industry
Canada   Retail      20
USA      Energy      15
         Finance     50
         Retail      70
Name: Value, dtype: int64
BallpointBen
  • 9,406
  • 1
  • 32
  • 62