14

I have two dataframes df1 df2with the same numbers of rows and columns and variables, and I'm trying to compare the boolean variable choice in the two dataframes. Then use if/else to manipulate the data. But something seems wrong when I try to compare the boolean var.

Here are my dataframes sample and codes:

#df1
v_100     choice #boolean
7          True
0          True
7          False
2          True

#df2
v_100     choice #boolean
1          False
2          True
74         True
6          True

def lastTwoTrials_outcome():
     df1 = df.iloc[5::6, :] #df1 and df2 are extracted from the same dataframe first
     df2 = df.iloc[4::6, :]

     if df1['choice'] != df2['choice']:  # if "choice" is different in the two dataframes
         df1['v_100'] = (df1['choice'] + df2['choice']) * 0.5

Here's the error:

if df1['choice'] != df2['choice']:
File "path", line 818, in wrapper
raise ValueError(msg)
ValueError: Can only compare identically-labeled Series objects

I found the same error here, and an answer suggests to sort_index first, but I don't really understand why though? Can anyone explain more in detail please (if that's the correct solution)?

Thanks!

Lumos
  • 1,303
  • 2
  • 17
  • 32

2 Answers2

9

I think you need reset_index for same index values and then comapare - for create new column is better use mask or numpy.where:

Also instead + use | because working with booleans.

df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
df1['v_100'] = df1['choice'].mask(df1['choice'] != df2['choice'],
                                  (df1['choice'] + df2['choice']) * 0.5)


df1['v_100'] = np.where(df1['choice'] != df2['choice'],
                       (df1['choice'] | df2['choice']) * 0.5,
                        df1['choice'])

Samples:

print (df1)
   v_100  choice
5      7    True
6      0    True
7      7   False
8      2    True

print (df2)
   v_100  choice
4      1   False
5      2    True
6     74    True
7      6    True

df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
print (df1)
   v_100  choice
0      7    True
1      0    True
2      7   False
3      2    True

print (df2)
   v_100  choice
0      1   False
1      2    True
2     74    True
3      6    True

df1['v_100'] = df1['choice'].mask(df1['choice'] != df2['choice'],
                                  (df1['choice'] | df2['choice']) * 0.5)

print (df1)
   v_100  choice
0    0.5    True
1    1.0    True
2    0.5   False
3    1.0    True
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    thanks for the help! I have tried both of them, but still the exact error occurs. – Lumos Jun 27 '17 at 06:14
  • What is `print (df1.index)` and `print (df2.index)` ? What is `print (len(df1.index))` and `print (len(df2.index))` ? – jezrael Jun 27 '17 at 06:15
  • `print (df1.index)` : RangeIndex(start=5, stop=2160, step=6) `print (df2.index)` : RangeIndex(start=4, stop=2160, step=6) `print (len(df1.index))` : 360 `print (len(df2.index))`: 360 . Thanks! – Lumos Jun 27 '17 at 06:33
  • So problem is indexes are different - different `start=5` vs `start=4`. do you try reset_index? – jezrael Jun 27 '17 at 06:34
  • Yes, the prints are after `reset_index`. So why are they still have different start then? I re-checked that they do have the same numbers of rows and columns. – Lumos Jun 27 '17 at 06:40
  • oops, I see it. need `drop=True` – jezrael Jun 27 '17 at 06:42
  • my bad, I did not reassign the var `df1` and `df2`...Sorry....It does work PERFECTLY. Thanks for your help!!! @jezrael – Lumos Jun 27 '17 at 06:59
  • Do you assign output like `df1 = df1.reset_index(drop=True) df2 = df2.reset_index(drop=True)` or only uses `df1.reset_index(drop=True) df2.reset_index(drop=True)` ? – jezrael Jun 27 '17 at 07:00
9

The error happens because you compare two pandas.Series objects with different indices. A simple solution would be to compare just the values in the series. Try it:

if df1['choice'].values != df2['choice'].values
Poe Dator
  • 4,535
  • 2
  • 14
  • 35