164

I have a DataFrame df:

    A    B
a   2    2 
b   3    1
c   1    3

I want to create a new column based on the following criteria:

if row A == B: 0

if rowA > B: 1

if row A < B: -1

so given the above table, it should be:

    A    B    C
a   2    2    0
b   3    1    1
c   1    3   -1 

For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?

nutship
  • 4,624
  • 13
  • 47
  • 64
  • You could just define a function and pass this to `apply` and set `axis=1` would work, not sure I can think of an operation that would give you what you want – EdChum Feb 11 '14 at 12:58
  • Your solution imply creating 3 columns and combining them into 1 column, or you have something different in mind? – nutship Feb 11 '14 at 13:00
  • You keep saying "creating 3 columns", but I'm not sure what you're referring to. – DSM Feb 11 '14 at 13:01
  • 1
    @DSM has answered this question but I meant something like `df['C']=df.apply(myFunc(row), axis=1)` where myFunc does what you want, this does not involve creating '3 columns' – EdChum Feb 11 '14 at 13:05
  • 1
    Possible duplicate of [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) – Georgy May 02 '18 at 17:34

7 Answers7

251

To formalize some of the approaches laid out above:

Create a function that operates on the rows of your dataframe like so:

def f(row):
    if row['A'] == row['B']:
        val = 0
    elif row['A'] > row['B']:
        val = 1
    else:
        val = -1
    return val

Then apply it to your dataframe passing in the axis=1 option:

In [1]: df['C'] = df.apply(f, axis=1)

In [2]: df
Out[2]:
   A  B  C
a  2  2  0
b  3  1  1
c  1  3 -1

Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.

Edit

Here is the vectorized version

df['C'] = np.where(
    df['A'] == df['B'], 0, np.where(
    df['A'] >  df['B'], 1, -1)) 
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • 5
    What if I want to pass another parameter along with row in the function? If I do, it says row not defined.. – prashanth manohar Mar 22 '17 at 10:29
  • 3
    You have to use the `args` parameter of the `.apply` function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html – Zelazny7 Mar 22 '17 at 13:31
  • 1
    I'm an old SAS user learning Python, and there's definitely a learning curve! :-) For example, the above code could be written in SAS as: `data df; set df; if A=B then C=0; else if A>B then C=1; else C=-1; run;` Very elegant and simple. – RobertF Feb 22 '19 at 16:24
  • thanks for the answer. However, I could not understand why `axis=1` is required. Without passing `axis=1` the row had only 1st column. – Jitendra Apr 26 '19 at 13:19
  • @Zelazny7 could you please give a vectorized version? – Averell Nov 11 '20 at 04:39
  • For the unvectorized, your function contains arguments - and so it would be: df['c'] = df(lambda row: f(row), axis=1) – Deskjokey Jan 12 '22 at 06:48
89
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1

Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.

Brian
  • 2,163
  • 1
  • 14
  • 26
18

For this particular relationship, you could use np.sign:

>>> df["C"] = np.sign(df.A - df.B)
>>> df
   A  B  C
a  2  2  0
b  3  1  1
c  1  3 -1
DSM
  • 342,061
  • 65
  • 592
  • 494
12

When you have multiple if conditions, numpy.select is the way to go:

In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]

In [4100]: df['C'] = np.select(conditions, choices)

In [4101]: df
Out[4101]: 
   A  B  C
a  2  2  0
b  3  1  1
c  1  3 -1
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
8

enter image description here

Lets say above one is your original dataframe and you want to add a new column 'old'

If age greater than 50 then we consider as older=yes otherwise False

step 1: Get the indexes of rows whose age greater than 50

row_indexes=df[df['age']>=50].index

step 2: Using .loc we can assign a new value to column

df.loc[row_indexes,'elderly']="yes"

same for age below less than 50

row_indexes=df[df['age']<50].index

df[row_indexes,'elderly']="no"

Ravi
  • 2,778
  • 2
  • 20
  • 32
4

You can use the method mask:

df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
0

a one liner solution with list comprehension and zip() is also

df.loc[:,'C']= [0 if d1==d2 else 1 if d1>d2 else -1 for d1,d2 in zip(df.A,df.B)]

which returns the desired output

Stefano Verugi
  • 101
  • 1
  • 5