2

I have a dataframe that looks like this

df = pd.DataFrame({"jointid": ['ab', 'ac', 'bc'],
                "id": ['a', 'a', 'b'],
                "dog": [0, 0, 0],
                "cat": [1, 1, 1],   
                "id2": ['b', 'c', 'c'],
                "dog2": [0, 1, 1],
                "cat2": [1, 0, 0],
                "common": [np.nan, np.nan, np.nan]})

I need to fill the common column with a dummy variable equal to 1 when both ids on the row have the same animal category animal=animal2. I use dog and cat here, but in the full data set I have 80 categories twice in each row to find these combinations. The desired output for this example is:

df = pd.DataFrame({"jointid": ['ab', 'ac', 'bc'],
                "id": ['a', 'a', 'b'],
                "dog": [0, 0, 0],
                "cat": [1, 1, 1],   
                "id2": ['b', 'c', 'c'],
                "dog2": [0, 1, 1],
                "cat2": [1, 0, 0],
                "common": [1, 0, 0]})

I have tried a lot of different methods, but the hang up seems to be in using the column names as list. Here's the gist of what I've been trying:

net = list(df.loc[:,'dog':'cat'].columns)
for x in net:    
    diff['common'] = np.where(df[x]==df[x+'2'], 1, 0)

Either a get a value of 1 assigned to everything or errors related to the list. Any help is appreciated!

rickwp
  • 23
  • 3

2 Answers2

1

Select by first an last animal, compare by convert columns to numpy array and then test if all Trues per row by DataFrame.all, here is necessary same order of animal columns:

mask = df.loc[:,'dog':'cat'].eq(df.loc[:,'dog2':'cat2'].to_numpy()).all(axis=1)

Another idea if order of columns should be different:

cols = ["cat", "dog"]
mask = df[cols].eq(df[pd.Index(cols) + '2'].to_numpy()).all(axis=1)

df['common'] = np.where(mask,1,0)

Alternative:

df['common'] = mask.view('i1')
print (df)
  jointid id  dog  cat id2  dog2  cat2  common
0      ab  a    0    1   b     0     1       1
1      ac  a    0    1   c     1     0       0
2      bc  b    0    1   c     1     0       0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The first version in one line worked great. What's the reasoning making it two lines by linking with mask? – rickwp Nov 13 '20 at 06:47
0

After reading link in comment, Use methods in accepted answer

df['new column name'] = df.apply(
    lambda x: 'value if condition is met' if x condition else 'value if condition is not met', axis=1)

in your case

df['common'] = df.apply(
    lambda x: 1 if x["cat"] == x["cat2"] and x["dog"] == x["dog2"] else 0, axis=1)

this will apply the condition across axis=1 which means apply function per row

I did not see that OP has 80 categories. Though I believe you would be able to achieve it in similar fashion.

categories = ["cat", "dog"]
df['common'] = df.apply(
    lambda x: 1 if all([x[category] == x[category+"2"] for category in categories]) else 0, axis=1)

I think this is less readable than accepted answer. But it is easier to scale since you just need to add category to categories

Inyoung Kim 김인영
  • 1,434
  • 1
  • 17
  • 38