Modifying a subset of rows in a pandas dataframe

Question

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

If you are looking for a very fast solution use NumPy's `where` as seen in [this solution below](https://stackoverflow.com/questions/12307099/modifying-a-subset-of-rows-in-a-pandas-dataframe/47086085) — Ted Petrou, Nov 02 '17 at 23:19

BrenBarn · Accepted Answer · 2018-12-03T06:52:51.237

317

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

edited Dec 03 '18 at 06:52

answered Sep 06 '12 at 19:37

BrenBarn

242,874
37
412
384

The second part of this is a nice answer to a question that wasn't even asked ;-) I am wondering if this is still the canonical pandas answer, in particular b/c it's an obvious DRY violation, though I assume it is in fact necessary to violate DRY given the constraints of pandas internals? (I may post exactly this sort of question, in more detail, but wanted to see if you had a quick answer before I did so) – JohnE Mar 12 '19 at 17:20
How to subset a Dataframe that doesn’t have column names , how to subset df just by index ? df.loc[df[0]==0] doesn’t work ... What’s the alternative ? Thank You – amipro Apr 10 '19 at 09:13
just one hint, if you want to assign another DF e.g. df.loc[df.A==0, df2] be aware that if df2 is continously indexed from 0 to end, it will do a indexwise assignment. So if the filtered **df** has index [1, 4, 5] and **df2** has [1, 2, 3] just the index 1 will be assigned. This can be prevented by using a unindexed structure on the right hand side, e.g. df.loc[df.A==0, np.array(df2)], then it will be assigned pairwise. This is also the case for pandas .mask() and .where() functions! – Berger Aug 09 '21 at 13:00

badgley · Answer 2 · 2017-09-15T18:52:21.393

110

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'

edited Sep 15 '17 at 18:52

answered Sep 26 '12 at 17:14

badgley

1,657
1
13
19

3

Yeah, somehow `loc[selection criteria, columns I want]` perfectly sticks to your mind... – EmEs Jan 10 '19 at 10:51

Mohamed Ali JAMAOUI · Answer 3 · 2019-05-02T14:36:01.607

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

score 19 · Answer 4 · edited Jun 20 '20 at 09:12

19

For a massive speed increase, use NumPy's where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with `numpy.where`

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 02 '17 at 23:18

Ted Petrou

59,042
19
131
136

I was curious about this so I tested it myself and difference was even greater using other parameters. Numpy was almost 10 times faster at replacing 0s with an integer instead of np.nan. I wonder what takes the extra time. – Alexander Aug 21 '18 at 13:30
Is it necessary to use `.values` in `np.where(df.a.values == 0, np.nan, df.b.values)`? Looks like `np.where(df.a == 0, np.nan, df.b)` also works? – hsl Feb 02 '20 at 23:17

score 6 · Answer 5 · answered Oct 25 '17 at 04:22

6

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

answered Oct 25 '17 at 04:22

Adrien Renaud

2,439
18
22

Mykola Zotko · Answer 6 · 2023-08-27T07:57:44.317

To modify a DataFrame in Pandas, you can use "syntactic sugar" operators like +=, *=, /= etc. So instead of:

df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2

You can write:

df.loc[df.A == 0, 'B'] /= 2

To replace values with NaN you can use Pandas methods mask or where. For example:

df  = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})

   A  B
0  1  0
1  2  0
2  3  4

df['A'].mask(df['B'] == 0, inplace=True) # other=np.nan by default
# df['A'].where(df['B'] != 0, inplace=True)

Result:

     A  B
0  NaN  0
1  NaN  0
2  3.0  4

Johnny V · Answer 7 · 2021-12-13T09:44:59.727

Alternatives:

no 1 looks best to me, but oddly I can't find the supporting documentation for it

Filter column as series (note: filter comes after column being written to, not before)

dataframe.column[filter condition]=values to change to

df.B[df.A==0] = np.nan

loc

dataframe.loc[filter condition, column to change]=values to change to

df.loc[df.A == 0, 'B'] = np.nan

numpy where

dataframe.column=np.where(filter condition, values if true, values if false)

import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)

apply lambda

dataframe.column=df.apply(lambda row: value if condition true else value if false, use rows not columns)

df.B = df.apply(lambda x: np.nan if x['A']==0 else x['B'],axis=1)

zip and list syntax

dataframe.column=[valuse if condition is true else value if false for elements a,b in list from zip function of columns a and b]

df.B = [np.nan if a==0 else b for a,b in zip(df.A,df.B)]

Modifying a subset of rows in a pandas dataframe

7 Answers7

Explanation:

Setup

Fast solution with `numpy.where`

Timings

Linked

Modifying a subset of rows in a pandas dataframe

7 Answers7

Explanation:

Setup

Fast solution with numpy.where

Timings

Linked

Fast solution with `numpy.where`