Using conditional to generate new column in pandas dataframe

Question

I have a pandas dataframe that looks like this:

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

I'd like to create a new column based on the used column, so that the df looks like this:

   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

Create a new alert column based on
If used is 1.0, alert should be Full.
If used is 0.0, alert should be Empty.
Otherwise, alert should be Partial.

What's the best way to do that?

possible duplicate of [Pandas conditional creation of a series/dataframe column](http://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) — chrisb, Nov 20 '14 at 14:17

score 50 · Answer 1 · edited Aug 19 '21 at 21:22

You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.apply to apply the function to each row. Note that you have to pass the keyword argument axis=1 to ensure that it applies the function to rows.

import pandas as pd

def alert(row):
  if row['used'] == 1.0:
    return 'Full'
  elif row['used'] == 0.0:
    return 'Empty'
  elif 0.0 < row['used'] < 1.0:
    return 'Partial'
  else:
    return 'Undefined'

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

#    portion  used    alert
# 0        1   1.0     Full
# 1        2   0.3  Partial
# 2        3   0.0    Empty
# 3        4   0.8  Partial

Great example. To make the code a little bit clearer (and since you are using `axis=1`), you could re-name the parameter `c` to `row`, that way is really obvious that you have access to all values of the row in the function. — Onema, Mar 22 '19 at 22:55

score 45 · Answer 2 · edited Oct 04 '17 at 17:31

45

Alternatively you could do:

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})

%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

Which gives the same output but runs about 100 times faster on 10000 rows:

100 loops, best of 3: 2.91 ms per loop

Then using apply:

%timeit df['alert'] = df.apply(alert, axis=1)

1 loops, best of 3: 287 ms per loop

I guess the choice depends on how big is your dataframe.

edited Oct 04 '17 at 17:31

Zero

74,117
18
147
154

answered Nov 20 '14 at 16:52

Primer

10,092
5
43
55

Question on the %timeit: if the first did 100 loops @2.91 seconds, does that mean total time is 291 ms, slightly longer than the 287ms time for the alert function to complete the 1 loop? – Nate Jun 03 '20 at 17:25
1

1 loop in this case is running 3 lines of code after `%%timeit`. Number of loops (100 in this case) are chosen automatically by the `timeit` program to give more robust measure within some sensible "timeout" (i.e. if running 1 loop is longer than this "timeout" there will be only 1 loop, like in case of "using apply"). Results of `timeit` should be compared on a "per 1 loop" basis. That's why there is "runs about 100 times faster" phrase: 1 loop that took 2.91 ms is about 100 times faster than 1 loop that took 287 ms. – Primer Jun 05 '20 at 11:37

score 21 · Answer 3 · answered Oct 04 '17 at 17:20

Use np.where, is usually fast

In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                 np.where(df.used == 0, 'Empty', 'Partial'))

In [846]: df
Out[846]:
   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

_Timings

In [848]: df.shape
Out[848]: (100000, 3)

In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop

In [850]: %%timeit
     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
     ...:
10 loops, best of 3: 21.9 ms per loop

In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop

This should be the accepted answer if your condition is not too complex. — François Leblanc, Mar 12 '20 at 18:46

tdy · Answer 4 · 2021-08-22T14:48:25.623

Use `np.select()` for >2 conditions

Given >2 conditions like OP's example, np.select() is much cleaner than nesting multiple levels of np.where() (and is just as fast).

Either define the conditions/choices as two lists (paired element-wise) with an optional default value ("else" case):

conditions = [
    df.used.eq(0),
    df.used.eq(1),
]
choices = [
    'Empty',
    'Full',
]
df['alert'] = np.select(conditions, choices, default='Partial')

Or define the conditions/choices as a dictionary for maintainability (easier to keep them paired properly when making additions/revisions):

conditions = {
    'Empty': df.used.eq(0),
    'Full': df.used.eq(1),
}
df['alert'] = np.select(conditions.values(), conditions.keys(), default='Partial')

`np.select()` is very fast

Timings with 5 conditions (full, high, medium, low, empty):

^{df = pd.DataFrame({'used': np.random.randint(10 + 1, size=10)}).div(10)}

Do you have the code or example of how you created this chart for this answer? I'd love to show this to some folks. — scarebear, Feb 09 '22 at 23:39
That's a [perfplot](https://pypi.org/project/perfplot/0.2.3/) @scarebear — Henry Ecker, Feb 16 '22 at 00:44

score 1 · Answer 5 · answered Dec 03 '17 at 06:40

Can't comment so making a new answer: Improving on Ffisegydd's approach, you can use a dictionary and the dict.get() method to make the function to pass in to .apply() easier to manage:

import pandas as pd

def alert(c):
    mapping = {1.0: 'Full', 0.0: 'Empty'}
    return mapping.get(c['used'], 'Partial')

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

Depending on the use case, you might like to define the dict outside of the function definition as well.

score 1 · Answer 6 · edited Mar 22 '19 at 13:19

1

df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))

This would appear to work, except for the ValueError: either both or neither of x and y should be given

edited Mar 22 '19 at 13:19

Rahul Agarwal

4,034
7
27
51

answered Oct 22 '18 at 01:47

manager_matt

395
4
19

Using conditional to generate new column in pandas dataframe

6 Answers6

Use `np.select()` for >2 conditions

`np.select()` is very fast

Linked

Using conditional to generate new column in pandas dataframe

6 Answers6

Use np.select() for >2 conditions

np.select() is very fast

Linked

Use `np.select()` for >2 conditions

`np.select()` is very fast