Compare values in 2 columns and output the result in a third column in pandas

Question

My data looks like below, where I am trying to create the column output with the given values.

      a_id b_received c_consumed
  0    sam       soap        oil
  1    sam        oil        NaN
  2    sam      brush       soap
  3  harry        oil      shoes
  4  harry      shoes        oil
  5  alice       beer       eggs
  6  alice      brush      brush
  7  alice       eggs        NaN

The code for producing the dataset is

df = pd.DataFrame({'a_id': 'sam sam sam harry harry alice alice alice'.split(),
               'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
               'c_consumed': 'oil NaN soap shoes oil eggs brush NaN'.split()})

I want a new column called Output which looks like this

      a_id b_received c_consumed   output
  0    sam       soap        oil   1
  1    sam        oil        NaN   1
  2    sam      brush       soap   0
  3  harry        oil      shoes   1
  4  harry      shoes        oil   1
  5  alice       beer       eggs   0
  6  alice      brush      brush   1 
  7  alice       eggs        NaN   1

So the search is if sam recieved soap, oil and brush, look for values in column 'consumed' for products he consumed, so if soap was consumed the output will be 1, but since brush wasn't consumed the output is 0.

Similarly for harry, he received oil and shoes, then look for oil and shoes in the consumed column, if oil was consumed, the output is 1.

To make it more clear, the output value corresponds to the first column (received), contingent on the value being present in the second column (consumed).

I tried using this code

   a=[]
   for i in range(len(df.b_received)):
         if any(df.c_consumed == df.b_received[i] ):
              a.append(1)
         else:
              a.append(0)

   df['output']=a

This gives me the output

       a_id b_received c_consumed  output
  0    sam       soap        oil       1
  1    sam        oil        NaN       1
  2    sam      brush       soap       1
  3  harry        oil      shoes       1
  4  harry      shoes        oil       1
  5  alice       beer       eggs       0
  6  alice      brush      brush       1
  7  alice       eggs        NaN       1

The problem is that since sam didn't consume brush, the output should be 0 but the output is 1, since brush was consumed by a different person (alice). I need to make sure that doesn't happen. The output needs to be specific to each person's consumption.

I know this is confusing, so if I have not made myself very clear, please do ask, I will answer your comments.

You should include the code you have written so far to achieve this. It would be helpful to include code, such that someone can copy and paste and create the dataframe, as well. — ilyas patanam, Feb 18 '16 at 22:17
Ok, I have added the code to reproduce the dataset, and yes its important who consumes it for later operations where I want to calculate the likelihood of each user consuming the item received in the future. If that was not the case I'd have simply used the 'find' function — Amit Singh Parihar, Feb 18 '16 at 22:32

JaminSore · Accepted Answer · 2016-02-18T23:29:03.053

The key is pandas.Series.isin() which checks for membership of each element in the calling pandas.Series in the object passed to pandas.Series.isin(). You want to check for membership of each of element in b_received with c_consumed, but only within each group as defined by a_id. When using groupby with apply pandas will index the object by the grouping variable as well as its original index. In your case, you don't need the grouping variable in the index, so you can reset the index back to what it was originally with reset_index using drop=True.

df['output'] = (df.groupby('a_id')
               .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
               .reset_index(level='a_id', drop=True))

Your DataFrame is now ...

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

Have a look a the documentation for split-apply-combine with pandas for a more thorough explanation.

Thank you for that, I will look into the split apply combine methodolgies, seems like I will have more use for it in the future — Amit Singh Parihar, Feb 18 '16 at 23:43

score 1 · Answer 2 · answered Feb 18 '16 at 23:35

This should work, although the ideal method would be the one given by JaminSore

df['output'] = 0

ctr = 0

for names in df['a_id'].unique():
    for n, row in df.loc[df.a_id == names].iterrows():
        if row['b_received'] in df.loc[df.a_id == names]['c_consumed'].values:
            df.ix[ctr:]['output']=1
            ctr+=1
        else:
            df.ix[ctr:]['output']=0
            ctr+=1

The dataframe now being

    a_id b_received c_consumed  output
0    sam       soap        oil       1
1    sam        oil        NaN       1
2    sam      brush       soap       0
3  harry        oil      shoes       1
4  harry      shoes        oil       1
5  alice       beer       eggs       0
6  alice      brush      brush       1
7  alice       eggs        NaN       1

Compare values in 2 columns and output the result in a third column in pandas

2 Answers2

Linked