Pandas: To create a new column based on two column values

Question

Learning column comparison. How to create a new column based on two columns?

I can do two conditions fruit or vegetable. But for the third condition, couldn’t do. :(

df
    basket1     basket2
0   fruit       fruit
1   vegetable   vegetable 
2   vegetable   both
3   fruit       both

The result

Newdf

    basket1    basket2    total
0   fruit      fruit      fruit
1   vegetable  vegetable  vegetable  
2   vegetable  both       Unknown
3   fruit      both      fruit

Thanks a lot for your help!

It's unclear what you're asking or trying to do. Did you try some code that got this result? Do you intend for the logic to produce `Unknown` when `vegetable` and `both` are in the other columns? — Conner, Nov 21 '18 at 04:54
Yes i want to separate vegetable both and fruit both to two separate words. — Pythonista, Nov 21 '18 at 04:58

aydow · Accepted Answer · 2021-07-09T05:30:01.910

Update

Revisiting this, DataFrame.apply is slow AF. Let's look at some other options and then compare.

Other options to `DataFrame.apply`

numpy.where

This method can be applied when we only have two options. In your case, this is true as we return df.a when df.a == df.b or df.a == 'fruit' and df.b == 'both'. The syntax is np.where(condition, value_if_true, value_if_false).

In [42]: df['np_where'] = np.where(
    ...:     ((df.a == df.b) | ((df.a == 'fruit') & (df.b == 'both'))),
    ...:     df.a,
    ...:     'Unknown'
    ...: )

numpy.select

You would use this option if you had multiple conditions. The syntax for this is np.select(condition, values, default) where default is an optional parameter.

In [43]: conditions = df.a == df.b, (df.a == 'fruit') & (df.b == 'both')

In [44]: choices = df['a'], df['a']

In [45]: df['np_select'] = np.select(conditions, choices, default='Unknown')

Note that for the purposes of demonstration I have created two conditions even if the outcomes yield the same result.

Comparing the options

As you can see, all three methods have the same outcome.

In [47]: df
Out[47]:
           a          b   np_where  np_select   df_apply
0      fruit      fruit      fruit      fruit      fruit
1  vegetable  vegetable  vegetable  vegetable  vegetable
2  vegetable       both    Unknown    Unknown    Unknown
3      fruit       both      fruit      fruit      fruit

But how do they compare in terms of speed? To check this, let's create a newer, larger DataFrame. We're doing this to see how our options perform with larger amounts of data.

In [48]: df_large = pd.DataFrame({
    ...:     'a': np.random.choice(['fruit', 'vegetable'], size=1_000_000),
    ...:     'b': np.random.choice(['fruit', 'vegetable', 'both'], size=1_000_000)
    ...: })

In [49]: %timeit df_large['np_where'] = np.where(((df_large.a == df_large.b) | ((df_large.a == 'fruit')
    ...:  & (df_large.b == 'both'))), df_large.a, 'Unknown')
379 ms ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit df_large['np_select'] = np.select(((df_large.a == df_large.b), ((df_large.a == 'fruit'
    ...: ) & (df_large.b == 'both'))), (df_large.a, df_large.a), default='Unknown')
580 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit df_large['df_apply'] = df_large.apply(total, axis=1)
40.5 s ± 6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Wow! As you can see DataFrame.apply is much slower than our other two options and np.where edges out np.select.

Conclusions

Use np.where if you have only two choices
Use np.select if you have multiple choices
Don't use DataFrame.apply (especially for large data sets)!

Resources

https://stackoverflow.com/a/19913845/5491375
https://stackoverflow.com/a/31173785/5491375 (has some fast options if you're only interested in a single column)

Create your own function and use DataFrame.apply

In [104]: def total(r):
     ...:     if r.a == r.b:
     ...:         return r.a
     ...:     elif (r.a == 'fruit') and (r.b == 'both'):
     ...:         return r.a
     ...:     return 'Unknown'
     ...:

In [105]: df = pd.DataFrame({'a': ['fruit', 'vegetable', 'vegetable', 'fruit'], 'b': ['fruit', 'vegetable', 'both', 'both']})

In [106]: df
Out[106]:
           a          b
0      fruit      fruit
1  vegetable  vegetable
2  vegetable       both
3      fruit       both

In [107]: df['total'] = df.apply(total, axis=1)

In [108]: df
Out[108]:
           a          b      total
0      fruit      fruit      fruit
1  vegetable  vegetable  vegetable
2  vegetable       both    Unknown
3      fruit       both      fruit

score 3 · Answer 2 · answered Nov 21 '18 at 06:22

df["total"] = df.apply(lambda x: x.a if (x.a == x.b) or ((x.a == 'fruit') and (x.b == 'both')) else 'Unkonwn', axis = 1)

Output

           a          b      total
0      fruit      fruit      fruit
1  vegetable  vegetable  vegetable
2  vegetable       both    Unkonwn
3      fruit       both      fruit

Sociopath · Answer 3 · 2018-11-21T05:45:48.763

1

Here's solution using np.select

df['total'] = np.select([df['a']==df['b'], (df['a']=='fruit')&(df['b']=='both')], [df['a'], 'fruit'], 'Unkown')

Output:

    a          b           total
0   fruit      fruit       fruit
1   vegetable  vegetable   vegetable
2   vegetable  both        Unknown 
3   fruit      both        fruit

edited Nov 21 '18 at 05:45

answered Nov 21 '18 at 05:38

Sociopath

13,068
19
47
75

score 0 · Answer 4 · answered Nov 21 '18 at 05:13

data.csv

basket1,basket2
fruit,fruit
vegetable,vegetable
vegetable,both
fruit,both

code.py

import pandas as pd

df = pd.read_csv('data.csv')
for i, r in df.iterrows():
    if df.at[i, 'basket1'] == df.at[i, 'basket2']:
        df.at[i, 'total'] = df.at[i, 'basket1']
    else:
        df.at[i, 'total'] = 'something else'

Output:

     basket1    basket2           total
0      fruit      fruit           fruit
1  vegetable  vegetable       vegetable
2  vegetable       both  something else
3      fruit       both  something else

Pandas: To create a new column based on two column values

4 Answers4

Other options to `DataFrame.apply`

Comparing the options

Conclusions

Resources

Linked

Pandas: To create a new column based on two column values

4 Answers4

Other options to DataFrame.apply

Comparing the options

Conclusions

Resources

Linked

Other options to `DataFrame.apply`