1

What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?

Given

df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
   a  b  
0  1  2  
1  3  4  

I want

   a  b    c  
0  1  2  NaN  
1  3  4  3.0  

Where the values in column 'c' are set from 'a' if 'b' >= 4.

(1) I tried:

df['c']=df[df['b']>=4]['a']
   a  b    c
0  1  2  NaN
1  3  4  3.0

which worked.

(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:

df.loc[df['b'] >= 4, 'c'] = df['a']
   a  b    c  
0  1  2  NaN  
1  3  4  3.0  

which worked.

(3) jp also showed a another way:

df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
   a  b    c  
0  1  2  NaN  
1  3  4  3.0  

which worked.

Which of the above is the most pandic? How does loc work?

Answers to the following did not work:

SpeedCoder5
  • 8,188
  • 6
  • 33
  • 34
  • 1
    Are you sure it gave you that error? There is no column 'c' yet when you execute that code. It should raise a KeyError? – ayhan Jun 15 '18 at 22:00
  • thanks. oops. I posted wrong code when assigning regading 'c' not being defnied. correcting above. – SpeedCoder5 Jun 15 '18 at 22:50
  • after reviewing replies below it seems (1) is the most pandiac and (3) is the most performant – SpeedCoder5 Jun 16 '18 at 13:34
  • 1
    When you only have two rows, `timeit` is not a reliable tool to measure the performance because it measures the overhead too. You should compare these methods on large datasets and you'll see that apply is in fact the least performant one. I haven't tested them but in my experience `np.where` outperforms most of the time. And when it comes to the pandas way of doing this, I would just go with `df['c'] = df['a'].where(df['b'] >= 4)`. – ayhan Jun 16 '18 at 13:37
  • 1
    @user2285236 - `df['c'] = df['a'].where(df['b'] >= 4)` works and seems the most natural way of expressing the original intent. thanks for the performance tip on large frames too. `where` is "where" it's at. :D – SpeedCoder5 Jun 16 '18 at 15:05

2 Answers2

1

This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:

df[df['b'] >= 4] = df['c']

You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:

df.loc[df['b'] >= 4, 'c'] = df['a']

This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:

df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • thank you for yoru quick reply. oops. I posted wrong code when assigning regading 'c' not being defnied. correcting above. – SpeedCoder5 Jun 15 '18 at 22:45
1

Other possible way may be to use apply:

df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)

Result:

   a  b    c
0  1  2  NaN
1  3  4  3.0

Comparing the timings, np.where seems to perform best here among different methods:

%timeit df.loc[df['b'] >= 4, 'c'] = df['a']

1000 loops, best of 3: 1.54 ms per loop

%timeit df['c']=df[df['b']>=4]['a']

1000 loops, best of 3: 869 µs per loop

%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)

1000 loops, best of 3: 440 µs per loop

%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

1000 loops, best of 3: 359 µs per loop

niraj
  • 17,498
  • 4
  • 33
  • 48