46

How to apply conditional logic to a Pandas DataFrame.

See DataFrame shown below,

   data desired_output
0     1          False
1     2          False
2     3           True
3     4           True

My original data is show in the 'data' column and the desired_output is shown next to it. If the number in 'data' is below 2.5, the desired_output is False.

I could apply a loop and do re-construct the DataFrame... but that would be 'un-pythonic'

Merlin
  • 24,552
  • 41
  • 131
  • 206
nitin
  • 7,234
  • 11
  • 39
  • 53
  • maybe I don't know pandas, but it seems that you have *two* numbers in `data` -- which one are you checking against (seemingly the one on the right? What relevance is the number on the left?) – mgilson Feb 05 '13 at 18:26
  • 4
    the number on the left is the index and the one on the right is the data – nitin Feb 05 '13 at 18:31
  • Does this answer your question? [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) – AMC Jan 25 '20 at 19:14

5 Answers5

75
In [1]: df
Out[1]:
   data
0     1
1     2
2     3
3     4

You want to apply a function that conditionally returns a value based on the selected dataframe column.

In [2]: df['data'].apply(lambda x: 'true' if x <= 2.5 else 'false')
Out[2]:
0     true
1     true
2    false
3    false
Name: data

You can then assign that returned column to a new column in your dataframe:

In [3]: df['desired_output'] = df['data'].apply(lambda x: 'true' if x <= 2.5 else 'false')

In [4]: df
Out[4]:
   data desired_output
0     1           true
1     2           true
2     3          false
3     4          false
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Although this answer is more verbose and not as simple as the answer @Jasc gave, it is more general and can be applied to other situations in which one wants output other than true and false. – Jacques Mathieu Jun 20 '18 at 16:49
  • 5
    `apply` + `lambda` is not recommended for easily vectorisable operations. Use `np.where` or `loc` methods instead to utilize Pandas / NumPy vectorisation. – jpp Aug 10 '18 at 13:12
31

Just compare the column with that value:

In [9]: df = pandas.DataFrame([1,2,3,4], columns=["data"])

In [10]: df
Out[10]: 
   data
0     1
1     2
2     3
3     4

In [11]: df["desired"] = df["data"] > 2.5
In [11]: df
Out[12]: 
   data desired
0     1   False
1     2   False
2     3    True
3     4    True
Jan Katins
  • 2,219
  • 1
  • 25
  • 35
17
In [34]: import pandas as pd

In [35]: import numpy as np

In [36]:  df = pd.DataFrame([1,2,3,4], columns=["data"])

In [37]: df
Out[37]: 
   data
0     1
1     2
2     3
3     4

In [38]: df["desired_output"] = np.where(df["data"] <2.5, "False", "True")

In [39]: df
Out[39]: 
   data desired_output
0     1          False
1     2          False
2     3           True
3     4           True
Surya
  • 11,002
  • 4
  • 57
  • 39
  • 1
    This is good, but the < seems unnecessarily confusing. If the condition is true, the first value results, if false the second value results. So it seems far more clear (and equivalent) to have the right side = np.where(df["data"] >= 2.5, "True", "False") – Wesley Kitlasten Oct 16 '18 at 14:48
14

In this specific example, where the DataFrame is only one column, you can write this elegantly as:

df['desired_output'] = df.le(2.5)

le tests whether elements are less than or equal 2.5, similarly lt for less than, gt and ge.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
0

You can also use eval here:

In [3]: df.eval('desired_output = data >= 2.5', inplace=True)

In [4]: df
Out[4]: 
   data  desired_output
0     1           False
1     2           False
2     3            True
3     4            True

Since inplace=True you don't need to assign it back to df.

rachwa
  • 1,805
  • 1
  • 14
  • 17