Compare two columns using pandas

Question

Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

which looks like

  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, create a new column by checking each row via the if statement.

The docs say to use .all but there is no example...

unutbu · Accepted Answer · 2014-12-15T11:44:27.400

You could use np.where. If cond is a boolean array, and A and B are arrays, then

C = np.where(cond, A, B)

defines C to be equal to A where cond is True, and B where cond is False.

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(The assumption may not be true if df['one'] or df['two'] contain NaNs.)

Note that

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

df2 = df.astype(float)

This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

score 127 · Answer 2 · edited Jul 19 '23 at 12:56

127

You can use .equals to compare 2 columns:

df['col1'].equals(df['col2'])

or to compare 2 dataframes:

df1.equals(df2)

If they're equal, that statement will return True, else False.

edited Jul 19 '23 at 12:56

blackraven

5,284
7
19
45

answered Jul 25 '16 at 21:35

ccook5760

1,330
1
8
6

score 34 · Answer 3 · edited Aug 24 '20 at 09:48

34

You could use apply() and do something like this

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

or if you prefer not to use a lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

edited Aug 24 '20 at 09:48

divykj

576
6
12

answered Dec 14 '14 at 22:46

Bob Haffner

8,235
1
36
43

Alex Riley · Answer 4 · 2014-12-14T22:54:11.100

One way is to use a Boolean series to index the column df['one']. This gives you a new column where the True entries have the same value as the same row as df['one'] and the False values are NaN.

The Boolean series is just given by your if statement (although it is necessary to use & instead of and):

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

If you want the NaN values to be replaced by other values, you can use the fillna method on the new column que. I've used 0 instead of the empty string here:

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

score 9 · Answer 5 · answered Dec 14 '14 at 22:47

Wrap each individual condition in parentheses, and then use the & operator to combine the conditions:

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

You can fill the non-matching rows by just using ~ (the "not" operator) to invert the match:

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

You need to use & and ~ rather than and and not because the & and ~ operators work element-by-element.

The final result:

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0

score 7 · Answer 6 · answered Aug 13 '21 at 03:56

I'd like to add this answer for those who are trying to compare the equality of values in two columns that have NaN values, and get False when both values are NaN. By definition, NaN != NaN (See: numpy.isnan(value) not the same as value == numpy.nan?).

If you want the two NaN comparison to return True, you can use:

df['compare'] = (df["col_1"] == df["col_2"]) | (df["col_1"].isna() & df["col_2"].isna())

score 4 · Answer 7 · answered Dec 03 '20 at 16:17

4

Use lambda expression:

df[df.apply(lambda x: x['col1'] != x['col2'], axis = 1)]

answered Dec 03 '20 at 16:17

aze45sq6d

876
3
11
26

score 3 · Answer 8 · edited Aug 16 '19 at 10:09

3

Use np.select if you have multiple conditions to be checked from the dataframe and output a specific choice in a different column

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

Note: No of conditions and no of choices should match, repeat text in choice if for two different conditions you have same choices

edited Aug 16 '19 at 10:09

Dharman

30,962
25
85
135

answered Aug 16 '19 at 09:41

psn1997

144
9

Mykola Zotko · Answer 9 · 2023-04-03T08:16:58.977

2

You can use the method where:

df['que'] = df['one'].where((df['one'] >= df['two']) & (df['one'] <= df['three']))

or the method eval:

df['que'] = df.loc[df.eval('(one >= two) & (two <= three)'), 'one']

Result:

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

edited Apr 03 '23 at 08:16

answered Nov 07 '21 at 20:34

Mykola Zotko

15,583
3
71
73

score 0 · Answer 10 · answered Oct 12 '18 at 19:28

0

I think the closest to the OP's intuition is an inline if statement:

df['que'] = (df['one'] if ((df['one'] >= df['two']) and (df['one'] <= df['three']))

answered Oct 12 '18 at 19:28

Nic Scozzaro

6,651
3
42
46

score 0 · Answer 11 · answered Oct 28 '22 at 00:04

If you're here to compare values in two dataframe columns, you can use eq():

df['one'].eq(df['two'])

or eval()

df.eval("one == two")

and if you want to reduce it to a single boolean, call all() on the result:

df['one'].eq(df['two']).all()
# or
df.eval("one == two").all()

This is a more "robust" check than equals() because for equals() to return True, the column dtypes must match as well. So if one column is dtype int and the other is dtype float, equals() would return False even if the values are the same, whereas eq().all()/eval().all() simply compares the columns element-wise.

If your columns includes NaN values, then use the following (which leverages the fact that NaN != NaN):

df.eval("one == two or one != one").all()

For OP's specific question, since the pattern is "A < B and B < C", you can use between():

cond = df['one'].between(df['two'], df['three'])
df['que'] = np.where(cond, df['one'], np.nan)

score 0 · Answer 12 · answered Jun 15 '23 at 08:13

To elaborate on @ccook5760's answer

You can use .equals for columns or entire dataframes.
df['col1'].equals(df['col2'])
If they're equal, that statement will return True, else False.

For the equality to be verified, the columns must contain the same values in the same order and their indexes must be identical too.

If you wanted to check equality of two columns from two different dataframes where order of values is not important and may vary, you can sort the values first.

It is also important to reset the index of the series so that the equality can be verified based only on the values.

Here is one way to do it :

df1['col1'].sort_values().reset_index(drop=True).equals(df2['col2'].sort_values().reset_index(drop=True))

Same method, in a more readable way :

s1 = df1['col1'].sort_values().reset_index(drop=True)
s2 = df2['col2'].sort_values().reset_index(drop=True)
s1.equals(s2)

score 0 · Answer 13 · answered Jun 26 '23 at 06:16

if your columns contain the same values in the same order and they are numeric then use equality else use .equals method

col1 contains: 58.1.2 , 29.2.4 col2 contains: 58.1.2 , 28.2.4

syntax: mydf['processedRecords'] = (mydf['col1'] == mydf['col2'])

so when it will match you will get True And when it won't match you will get False

but the values are non-numeric go for .equals method

syntax: mydf['processedRecords'] = mydf['col1'].equals(mydf['col2'])

Compare two columns using pandas

13 Answers13

Linked

Related