201

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?

I tried the following

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.

Arthur B.
  • 3,445
  • 3
  • 21
  • 24
  • If you are looking for a very fast solution use NumPy's `where` as seen in [this solution below](https://stackoverflow.com/questions/12307099/modifying-a-subset-of-rows-in-a-pandas-dataframe/47086085) – Ted Petrou Nov 02 '17 at 23:19

7 Answers7

317

Use .loc for label based indexing:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. You can also use this to transform a subset of a column, e.g.:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. According to documentation here, this behavior depends on the underlying numpy behavior. I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • The second part of this is a nice answer to a question that wasn't even asked ;-) I am wondering if this is still the canonical pandas answer, in particular b/c it's an obvious DRY violation, though I assume it is in fact necessary to violate DRY given the constraints of pandas internals? (I may post exactly this sort of question, in more detail, but wanted to see if you had a quick answer before I did so) – JohnE Mar 12 '19 at 17:20
  • How to subset a Dataframe that doesn’t have column names , how to subset df just by index ? df.loc[df[0]==0] doesn’t work ... What’s the alternative ? Thank You – amipro Apr 10 '19 at 09:13
  • just one hint, if you want to assign another DF e.g. df.loc[df.A==0, df2] be aware that if df2 is continously indexed from 0 to end, it will do a indexwise assignment. So if the filtered **df** has index [1, 4, 5] and **df2** has [1, 2, 3] just the index 1 will be assigned. This can be prevented by using a unindexed structure on the right hand side, e.g. df.loc[df.A==0, np.array(df2)], then it will be assigned pairwise. This is also the case for pandas .mask() and .where() functions! – Berger Aug 09 '21 at 13:00
110

Here is from pandas docs on advanced indexing:

The section will explain exactly what you need! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. And. It can also be used to set things.

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0, select column B and set it to np.nan'

badgley
  • 1,657
  • 1
  • 13
  • 19
  • 3
    Yeah, somehow `loc[selection criteria, columns I want]` perfectly sticks to your mind... – EmEs Jan 10 '19 at 10:51
33

Starting from pandas 0.20 ix is deprecated. The right way is to use df.loc

here is a working example

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation:

As explained in the doc here, .loc is primarily label based, but may also be used with a boolean array.

So, what we are doing above is applying df.loc[row_index, column_index] by:

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. In the above example, we want any rows that contain a 0, for that we can use df.A == 0, as you can see in the example below, this returns a series of booleans.

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here.

Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
19

For a massive speed increase, use NumPy's where function.

Setup

Create a two-column DataFrame with 100,000 rows with some zeros.

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster

Community
  • 1
  • 1
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • I was curious about this so I tested it myself and difference was even greater using other parameters. Numpy was almost 10 times faster at replacing 0s with an integer instead of np.nan. I wonder what takes the extra time. – Alexander Aug 21 '18 at 13:30
  • Is it necessary to use `.values` in `np.where(df.a.values == 0, np.nan, df.b.values)`? Looks like `np.where(df.a == 0, np.nan, df.b)` also works? – hsl Feb 02 '20 at 23:17
6

To replace multiples columns convert to numpy array using .values:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2
Adrien Renaud
  • 2,439
  • 18
  • 22
2

To modify a DataFrame in Pandas, you can use "syntactic sugar" operators like +=, *=, /= etc. So instead of:

df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2

You can write:

df.loc[df.A == 0, 'B'] /= 2

To replace values with NaN you can use Pandas methods mask or where. For example:

df  = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})

   A  B
0  1  0
1  2  0
2  3  4

df['A'].mask(df['B'] == 0, inplace=True) # other=np.nan by default
# df['A'].where(df['B'] != 0, inplace=True) 

Result:

     A  B
0  NaN  0
1  NaN  0
2  3.0  4
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

Alternatives:

no 1 looks best to me, but oddly I can't find the supporting documentation for it

  1. Filter column as series (note: filter comes after column being written to, not before)

dataframe.column[filter condition]=values to change to

df.B[df.A==0] = np.nan
  1. loc

dataframe.loc[filter condition, column to change]=values to change to

df.loc[df.A == 0, 'B'] = np.nan
  1. numpy where

dataframe.column=np.where(filter condition, values if true, values if false)

import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)
  1. apply lambda

dataframe.column=df.apply(lambda row: value if condition true else value if false, use rows not columns)

df.B = df.apply(lambda x: np.nan if x['A']==0 else x['B'],axis=1)
  1. zip and list syntax

dataframe.column=[valuse if condition is true else value if false for elements a,b in list from zip function of columns a and b]

df.B = [np.nan if a==0 else b for a,b in zip(df.A,df.B)]
Johnny V
  • 1,108
  • 14
  • 21