Pandas calculate new row as difference between rows with same column value where exists

Question

As a simplified example, suppose I had a DataFrame as follows:

Group   Type   Value1   Value2
Red     A      13       24
Red     B      3        12
Blue    C      5        0
Red     C      8        9
Green   A      2        -1
Red     None   56       78
Blue    A      40       104
Green   B      1        -5

What I want to calculate is the difference in Value1 between rows of Type A and B and the difference in Value2 between rows of Type A and B for each Group entry.

Since Red and Green are the only Groups having entries of Type A and B, we would only calculate new rows for these Groups. So the resulting DataFrame would be:

Group   Type   Value1   Value2
Red     A-B    10       12
Green   A-B    1        4

My initial idea was simply to filter for rows where Type is either 'A' or 'B' with df = df[df['Type'].isin(['A', 'B'])], then filter again for Groups that are in rows with both 'A' and 'B' as Type (not sure how to do this), then sort and apply diff().

What I've been trying to figure out is how I can filter out Groups where both Type 'A' and 'B' don't appear. For instance, I can filter out the three rows where Type is not in 'A' or 'B'. But I would also like to filter out the row where Group = 'Blue' and Type='A' since there is no corresponding row with Type='B' and that is where I'm stuck. — ahh_real_numbers, Aug 26 '19 at 17:22
i would just use `spread()` to create categorical columns of type so that each observation would have a separate column much like the tidyr function `spread()` https://stackoverflow.com/questions/45220904/python-pandas-pivot-how-to-do-a-proper-tidyr-like-spread — rgalbo, Aug 26 '19 at 21:58

score 1 · Answer 1 · answered Aug 26 '19 at 17:40

So the following code will make groups for each type, then subtract each dataframe from each other dataframe, resulting in a final dataframe that has the subtracted values. Input your dataframe as inp_df, and the dataframe you want will be final_df:

grouped = inp_df.groupby('Type')

# Getting the list of groups:
list_o_groups = list(grouped.groups.keys())

# Going through each group and subtracting the one from the other:
sub_df_dict = {}
for first_idx, first_df in enumerate(list_o_groups):
    for second_idx, second_df in enumerate(list_o_groups):
        if second_idx <= first_idx:
            continue
        sub_df_dict['%s-%s' % (first_df, second_df)] = pd.DataFrame()
        sub_df_dict['%s-%s' % (first_df, second_df)]['Value1'] = grouped.get_group(first_df)['Value1'] - grouped.get_group(second_df)['Value1']
        sub_df_dict['%s-%s' % (first_df, second_df)]['Value2'] = grouped.get_group(first_df)['Value2'] - grouped.get_group(second_df)['Value2']
        sub_df_dict['%s-%s' % (first_df, second_df)]['Type'] = ['%s-%s' % (first_df, second_df)] * sub_df_dict['%s-%s' % (first_df, second_df)].shape[0]

# Combining them into one df:
for idx, each_key in enumerate(sub_df_dict.keys()):
    if idx == 0:
        final_df = sub_df_dict[each_key]
        continue
    else:
        final_df = final_df.append(sub_df_dict[each_key])

# Cleaning up the dataframe
final_df.dropna(inplace=True)

The result of this code on your example dataframe.

*EDIT - added the dropna to clean up the df.

score 1 · Accepted Answer · answered Aug 26 '19 at 22:23

import pandas as pd
from io import StringIO

# read data using string io
data = StringIO("""Group,Type,Value1,Value2
Red,A,13,24
Red,B,3,12
Blue,C,5,0
Red,C,8,9
Green,A,2,-1
Red,None,56,78
Blue,A,40,104
Green,B,1,-5""")
df = pd.read_csv(data)

# create tidyr spread like operation
def spread(df, propcol, valcol):
    indcol = list(df.columns.drop(valcol))
    df = df.set_index(indcol).unstack(propcol).reset_index()
    df.columns = [i[1] if i[0] == valcol else i[0] for i in df.columns]
    return df
df = spread(df, 'Group','Type')

# create filter conditions to remove 'C'. can also do the opposite
notBlueC = df['Blue'] != 'C'
notGreenC = df['Green'] != 'C'
notRedC = df['Red'] != 'C'
clean_df = df[notBlueC & notGreenC & notRedC]

Pandas calculate new row as difference between rows with same column value where exists

2 Answers2