1

I have these two dataframes relating to the same data. One of them contains the total of the data and looks like this:

Person ID   word    rt  accuracy    emotional_w
0          CHOQUE   1353    C   True
0          SILLA    434     C   False
0          BRAZO    480     C   False
0          LLUVIA   1091    C   False
1          SOLEDAD  637     C   True
1          INFIERNO 437     I   True
1          MOMENTO  754     C   False

The other one contains the mean rt and the standard deviation rt and then 'desvios_mayores' and 'desvios_menores' which are the numbers I'm interested in:

   Person ID   rt_stdev         rt_mean            desvios_mayores      desvios_menores
0   0       311.200383049439    655.975609756098    1278.37637585498    33.5748436572201
1   1       280.592497402182    971.416666666667    1532.60166147103    410.231671862303
2   2       325.848282375085    928.630952380953    1580.32751713112    276.934387630783

I need to check if the person's rt in each of their words is bigger than desvios_mayores or smaller than devios_menores and, if so, replace that number with their rt_mean.

I wrote this so far but it raises error "ValueError: Can only compare identically-labeled Series objects":

if df_outliers_total['Person ID'] == df['Person ID']:
    if df['rt'] > df_outliers_total['desvios_mayores']:
        df_outliers_total['rt_mean']
    elif df['rt'] < df_outliers_total['desvios_menores']:
        df_outliers_total['rt_mean']

What's a better way of achieving this? Thank you.

dazai
  • 766
  • 4
  • 25

2 Answers2

1

For compare values is necessary left join DataFrame.merge and then set new values in Series.mask by chained both masks by | for bitwise OR:

df1 = df.merge(df_outliers_total, on='Person ID', how='left')

m = (df1['rt'] > df1['desvios_mayores']) | (df1['rt'] < df1['desvios_menores'])
df1['rt'] = df1['rt'].mask(m, df1['rt_mean'])

#for original columns names
df1 = df1.reindex(df.columns, axis=1)
print (df1)
   Person ID      word          rt accuracy  emotional_w
0          0    CHOQUE   655.97561        C         True
1          0     SILLA   434.00000        C        False
2          0     BRAZO   480.00000        C        False
3          0    LLUVIA  1091.00000        C        False
4          1   SOLEDAD   637.00000        C         True
5          1  INFIERNO   437.00000        I         True
6          1   MOMENTO   754.00000        C        False
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Here's another way to accomplish this:


original_cols = df1.columns
df1 = df1.merge(df_outliers_total, on="Person ID", how="left")
df1['rt'] = df1.apply(lambda x: x['rt_mean'] if (x['rt'] > x['desvios_mayores'] or x['rt'] < x['desvios_menores']) else x['rt'], axis=1)

print(df1[original_cols])


  Person ID      word          rt accuracy  emotional_w
0         0    CHOQUE   655.97561        C         True
1         0     SILLA   434.00000        C        False
2         0     BRAZO   480.00000        C        False
3         0    LLUVIA  1091.00000        C        False
4         1   SOLEDAD   637.00000        C         True
5         1  INFIERNO   437.00000        I         True
6         1   MOMENTO   754.00000        C        False

KJDII
  • 851
  • 4
  • 11
  • Why is not good idea use `apply` here - [link](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code/54432584#54432584) - in few words - it is looping under the hood, so slowies like vectorized solutions – jezrael Oct 08 '21 at 13:54
  • 1
    Ah good to know. Thanks for the additional info! – KJDII Oct 08 '21 at 14:30