Pandas DataFrame: replace all values in a column, based on condition

Question

I have a simple DataFrame like the following:

	Team	First Season	Total Games
0	Dallas Cowboys	1960	894
1	Chicago Bears	1920	1357
2	Green Bay Packers	1921	1339
3	Miami Dolphins	1966	792
4	Baltimore Ravens	1996	326
5	San Francisco 49ers	1950	1003

I want to select all values from the First Season column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).

I have used the following:

df.loc[(df['First Season'] > 1990)] = 1

But, it replaces all the values in that row by 1, not just the values in the 'First Season' column.

How can I replace just the values from that column?

EdChum · Accepted Answer · 2015-07-20T08:51:27.270

You need to select that column:

In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df

Out[41]:
                 Team  First Season  Total Games
0      Dallas Cowboys          1960          894
1       Chicago Bears          1920         1357
2   Green Bay Packers          1921         1339
3      Miami Dolphins          1966          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers          1950         1003

So the syntax here is:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

You can check the docs and also the 10 minutes to pandas which shows the semantics

EDIT

If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:

In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df

Out[43]:
                 Team  First Season  Total Games
0      Dallas Cowboys             0          894
1       Chicago Bears             0         1357
2   Green Bay Packers             0         1339
3      Miami Dolphins             0          792
4    Baltimore Ravens             1          326
5  San Franciso 49ers             0         1003

score 89 · Answer 2 · answered Nov 27 '18 at 11:16

89

A bit late to the party but still - I prefer using numpy where:

import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])

answered Nov 27 '18 at 11:16

Amir F

2,431
18
12

4

I was looking for a solution for overwriting column values conditionally, but based on an other column's value, like this: df['col1'] = np.where(df['id'] == '318431682259014', 'NEW', df['col1']) This was the solution for it. – user582175 Nov 06 '19 at 12:11
1

I am trying to do this for multiple conditions like this, but I keep getting `ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()`. What I'm trying to do is basically `df['A'] = np.where(df['B'] in some_values, df['A']*2, df['A]`. Does anybody have an idea on this? – riskypenguin Feb 03 '20 at 19:14
3

There is a built-in `where` method in pandas now, and it is compared to `np.where` in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where – Edward Feb 09 '21 at 09:22
I was playing around with this df and found that if you changed the code to ...nfl_df['First Season'] = np.where(nfl_df['First Season'] > 1990, 1, nfl_df['Total Games']) then it replaces all values in First Season with values in Total Games, not just ones over 1990. Why would it do this? It doesn't seem very logical. – chucknor Nov 06 '21 at 15:57

score 21 · Answer 3 · edited Dec 18 '20 at 02:37

21

df.loc[df['First season'] > 1990, 'First Season'] = 1

Explanation:

df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.

edited Dec 18 '20 at 02:37

MoD

564
4
14

answered Jun 10 '20 at 09:23

Abdullah shafi

211
2
2

Odz · Answer 4 · 2020-02-14T16:18:37.130

17

df['First Season'].loc[(df['First Season'] > 1990)] = 1

strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.

edited Feb 14 '20 at 16:18

answered Feb 14 '20 at 16:12

Odz

191
1
6

6

That gives a 'SettingWithCopyWarning:' It's better to use .loc for the entire thing like in EdChum's answer. – ambitiousdonut Mar 18 '20 at 18:52

Harshit Jain · Answer 5 · 2020-06-11T03:44:41.297

for single condition, ie. ( 'employrate'] > 70 )

       country        employrate alcconsumption
0  Afghanistan  55.7000007629394            .03
1      Albania  51.4000015258789           7.29
2      Algeria              50.5            .69
3      Andorra                            10.17
4       Angola  75.6999969482422           5.57

use this:

df.loc[df['employrate'] > 70, 'employrate'] = 7

       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   51.400002           7.29
2      Algeria   50.500000            .69
3      Andorra         nan          10.17
4       Angola    7.000000           5.57

therefore syntax here is:

df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]

For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)

use this:

df['employrate'] = np.where(
   (df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
   )

out[108]:
       country  employrate alcconsumption
0  Afghanistan   55.700001            .03
1      Albania   11.000000           7.29
2      Algeria   11.000000            .69
3      Andorra         nan          10.17
4       Angola   75.699997           5.57

therefore syntax here is:

 df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])

score 5 · Answer 6 · answered Feb 26 '23 at 05:02

TLDR — Here is some usage guidance, including some methods that haven't been mentioned yet:

Use case	Recommended	Example
Speed	`DataFrame.loc`	`df.loc[df['A'] < 10, 'A'] = 1`
Method chaining	`Series.mask`	`df['A'] = df['A'].mask(df['A'] < 10, 1).method1().method2()`
Whole dataframe	`DataFrame.mask`	`df = df.mask(df['A'] < 10, df**2)`
Multiple conditions	`np.select`	`df['A'] = np.select([df['A'] < 10, df['A'] > 20], [1, 2], default=df['A'])`

1. Speed

Use DataFrame.loc if you have a large dataframe and are concerned about speed:

df.loc[df['Season'] > 1990, 'Season'] = 1

For small dataframes, speed is trivial, but technically there are faster options if you want:

^{df = pd.DataFrame({'Team': np.random.choice([*'ABCDEFG'], size=n), 'Season': np.random.randint(1900, 2001, size=n), 'Games': np.random.randint(0, 17, size=n)})}

2. Method chaining

Use a Series method if you want to conditionally replace values within a method chain:

Series.mask replaces values where the given condition is true
```
df['Season'] = df['Season'].mask(df['Season'] > 1990, 1)
```
Series.where is just the inverted version (replace when false)
```
df['Season'] = df['Season'].where(df['Season'] <= 1990, 1)
```

The chaining benefit is not obvious in OP's example but is very useful in other situations. Just as a toy example:

# compute average games per team, but pre-1972 games are weighted by half
df['Games'].mask(df['Season'] < 1972, 0.5*df['Games']).groupby(df['Team']).mean()

Practical examples:

3. Whole dataframe

Use DataFrame.mask if you want to conditionally replace values throughout the whole dataframe.

It's not easy to come up with a meaningful example given OP's sample, but here is a trivial example for demonstration:

# replace the given elements with the doubled value (or repeated string)
df.mask(df.isin(['Chicago Bears', 'Buffalo Bills', 8, 1990]), 2*df)

Practical example:

mask each row's maximum with its mean

4. Multiple conditions

Use np.select if you have multiple conditions, each with a different replacement:

# replace pre-1920 seasons with 0 and post-1990 seasons with 1
conditions = {
    0: df['Season'] < 1920,
    1: df['Season'] > 1990,
}
df['Season'] = np.select(conditions.values(), conditions.keys(), default=df['Season'])

Practical example:

replace values based on multiple regex conditions

score 4 · Answer 7 · answered Feb 03 '21 at 19:20

4

Another option is to use a list comprehension:

df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]

answered Feb 03 '21 at 19:20

Henry

127
1
2
9

1

This is the best option when you need to work directly on the value instead of using a constant value. – Baguette Aug 03 '21 at 14:56

score 1 · Answer 8 · answered Apr 23 '22 at 15:21

1

You can also use mask which replaces the values where the condition is met:

df['First Season'].mask(lambda col: col > 1990, 1)

answered Apr 23 '22 at 15:21

rachwa

1,805
1
14
17

score 0 · Answer 9 · answered Oct 11 '20 at 14:56

We can update the First Season column in df with the following syntax:

df['First Season'] = expression_for_new_values

To map the values in First Season we can use pandas‘ .map() method with the below syntax:

data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

Pandas DataFrame: replace all values in a column, based on condition

9 Answers9

1. Speed

2. Method chaining

3. Whole dataframe

4. Multiple conditions

Linked

Related