Pandaic reasoning behind a way to conditionally update new value from other values in same row in DataFrame

Question

What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?

Given

df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
   a  b  
0  1  2  
1  3  4

I want

   a  b    c  
0  1  2  NaN  
1  3  4  3.0

Where the values in column 'c' are set from 'a' if 'b' >= 4.

(1) I tried:

df['c']=df[df['b']>=4]['a']
   a  b    c
0  1  2  NaN
1  3  4  3.0

which worked.

(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:

df.loc[df['b'] >= 4, 'c'] = df['a']
   a  b    c  
0  1  2  NaN  
1  3  4  3.0

which worked.

(3) jp also showed a another way:

df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
   a  b    c  
0  1  2  NaN  
1  3  4  3.0

which worked.

Which of the above is the most pandic? How does loc work?

Answers to the following did not work:

Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal

Are you sure it gave you that error? There is no column 'c' yet when you execute that code. It should raise a KeyError? — ayhan, Jun 15 '18 at 22:00
thanks. oops. I posted wrong code when assigning regading 'c' not being defnied. correcting above. — SpeedCoder5, Jun 15 '18 at 22:50
after reviewing replies below it seems (1) is the most pandiac and (3) is the most performant — SpeedCoder5, Jun 16 '18 at 13:34
When you only have two rows, `timeit` is not a reliable tool to measure the performance because it measures the overhead too. You should compare these methods on large datasets and you'll see that apply is in fact the least performant one. I haven't tested them but in my experience `np.where` outperforms most of the time. And when it comes to the pandas way of doing this, I would just go with `df['c'] = df['a'].where(df['b'] >= 4)`. — ayhan, Jun 16 '18 at 13:37
@user2285236 - `df['c'] = df['a'].where(df['b'] >= 4)` works and seems the most natural way of expressing the original intent. thanks for the performance tip on large frames too. `where` is "where" it's at. :D — SpeedCoder5, Jun 16 '18 at 15:05

score 1 · Answer 1 · answered Jun 15 '18 at 21:57

This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:

df[df['b'] >= 4] = df['c']

You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:

df.loc[df['b'] >= 4, 'c'] = df['a']

This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:

df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

thank you for yoru quick reply. oops. I posted wrong code when assigning regading 'c' not being defnied. correcting above. — SpeedCoder5, Jun 15 '18 at 22:45

score 1 · Accepted Answer · answered Jun 16 '18 at 00:25

Other possible way may be to use apply:

df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)

Result:

   a  b    c
0  1  2  NaN
1  3  4  3.0

Comparing the timings, np.where seems to perform best here among different methods:

%timeit df.loc[df['b'] >= 4, 'c'] = df['a']

1000 loops, best of 3: 1.54 ms per loop

%timeit df['c']=df[df['b']>=4]['a']

1000 loops, best of 3: 869 µs per loop

%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)

1000 loops, best of 3: 440 µs per loop

%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

1000 loops, best of 3: 359 µs per loop

Pandaic reasoning behind a way to conditionally update new value from other values in same row in DataFrame

2 Answers2