1

As a simplified example, suppose I had a DataFrame as follows:

Group   Type   Value1   Value2
Red     A      13       24
Red     B      3        12
Blue    C      5        0
Red     C      8        9
Green   A      2        -1
Red     None   56       78
Blue    A      40       104
Green   B      1        -5

What I want to calculate is the difference in Value1 between rows of Type A and B and the difference in Value2 between rows of Type A and B for each Group entry.

Since Red and Green are the only Groups having entries of Type A and B, we would only calculate new rows for these Groups. So the resulting DataFrame would be:

Group   Type   Value1   Value2
Red     A-B    10       12
Green   A-B    1        4

My initial idea was simply to filter for rows where Type is either 'A' or 'B' with df = df[df['Type'].isin(['A', 'B'])], then filter again for Groups that are in rows with both 'A' and 'B' as Type (not sure how to do this), then sort and apply diff().

ahh_real_numbers
  • 464
  • 3
  • 14
  • did you try that? – rgalbo Aug 26 '19 at 17:13
  • What I've been trying to figure out is how I can filter out Groups where both Type 'A' and 'B' don't appear. For instance, I can filter out the three rows where Type is not in 'A' or 'B'. But I would also like to filter out the row where Group = 'Blue' and Type='A' since there is no corresponding row with Type='B' and that is where I'm stuck. – ahh_real_numbers Aug 26 '19 at 17:22
  • 1
    i would just use `spread()` to create categorical columns of type so that each observation would have a separate column much like the tidyr function `spread()` https://stackoverflow.com/questions/45220904/python-pandas-pivot-how-to-do-a-proper-tidyr-like-spread – rgalbo Aug 26 '19 at 21:58

2 Answers2

1

So the following code will make groups for each type, then subtract each dataframe from each other dataframe, resulting in a final dataframe that has the subtracted values. Input your dataframe as inp_df, and the dataframe you want will be final_df:

grouped = inp_df.groupby('Type')

# Getting the list of groups:
list_o_groups = list(grouped.groups.keys())

# Going through each group and subtracting the one from the other:
sub_df_dict = {}
for first_idx, first_df in enumerate(list_o_groups):
    for second_idx, second_df in enumerate(list_o_groups):
        if second_idx <= first_idx:
            continue
        sub_df_dict['%s-%s' % (first_df, second_df)] = pd.DataFrame()
        sub_df_dict['%s-%s' % (first_df, second_df)]['Value1'] = grouped.get_group(first_df)['Value1'] - grouped.get_group(second_df)['Value1']
        sub_df_dict['%s-%s' % (first_df, second_df)]['Value2'] = grouped.get_group(first_df)['Value2'] - grouped.get_group(second_df)['Value2']
        sub_df_dict['%s-%s' % (first_df, second_df)]['Type'] = ['%s-%s' % (first_df, second_df)] * sub_df_dict['%s-%s' % (first_df, second_df)].shape[0]

# Combining them into one df:
for idx, each_key in enumerate(sub_df_dict.keys()):
    if idx == 0:
        final_df = sub_df_dict[each_key]
        continue
    else:
        final_df = final_df.append(sub_df_dict[each_key])

# Cleaning up the dataframe
final_df.dropna(inplace=True)

The result of this code on your example dataframe.

*EDIT - added the dropna to clean up the df.

Alexandre Daly
  • 320
  • 1
  • 7
1
import pandas as pd
from io import StringIO

# read data using string io
data = StringIO("""Group,Type,Value1,Value2
Red,A,13,24
Red,B,3,12
Blue,C,5,0
Red,C,8,9
Green,A,2,-1
Red,None,56,78
Blue,A,40,104
Green,B,1,-5""")
df = pd.read_csv(data)

# create tidyr spread like operation
def spread(df, propcol, valcol):
    indcol = list(df.columns.drop(valcol))
    df = df.set_index(indcol).unstack(propcol).reset_index()
    df.columns = [i[1] if i[0] == valcol else i[0] for i in df.columns]
    return df
df = spread(df, 'Group','Type')

# create filter conditions to remove 'C'. can also do the opposite
notBlueC = df['Blue'] != 'C'
notGreenC = df['Green'] != 'C'
notRedC = df['Red'] != 'C'
clean_df = df[notBlueC & notGreenC & notRedC]
rgalbo
  • 4,186
  • 1
  • 19
  • 29