34

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.

Initially I tried for-loop on each value of the dataframe which was taking too much time.

Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0. But an error occurred as the dataframe had multiple string entries.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Anirban De
  • 341
  • 1
  • 3
  • 3
  • Possible duplicate of [How can I replace all the NaN values with Zero's in a column of a pandas dataframe](http://stackoverflow.com/questions/13295735/how-can-i-replace-all-the-nan-values-with-zeros-in-a-column-of-a-pandas-datafra) – Ani Menon May 31 '16 at 10:57

10 Answers10

57

You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:

newdf = df.notnull().astype('int')

If you really want to write into your original DataFrame, this will work:

df.loc[~df.isnull()] = 1  # not nan
df.loc[df.isnull()] = 0   # nan
fmarc
  • 1,706
  • 15
  • 20
  • Sorry, please dont copy my answer. I think your version is equal by mine, so I think it is not necessary. – jezrael May 31 '16 at 13:10
  • 5
    Did you notice that I posted this answer before you? – fmarc May 31 '16 at 13:12
  • Yes, but not with `notnull()`. Your answer was with `~isnull()`. And it is equal, so I think your solution is so good as mine. – jezrael May 31 '16 at 13:13
  • You're right that both our answers achieve the same in pretty much the same way. I tried to improve my answer by removing an unneccessary double inversion. – fmarc May 31 '16 at 13:19
  • Yes, I understand you. But unfortunately if some answer is better, I think it is bad copy it. You can upvote it -;) [I do it only twice in SO, because I aware about be low in ordering :( – jezrael May 31 '16 at 13:22
24

Use notnull with casting boolean to int by astype:

print ((df.notnull()).astype('int'))

Sample:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
     a    b
0  NaN  1.0
1  4.0  NaN
2  NaN  3.0

print (df.notnull())
       a      b
0  False   True
1   True  False
2  False   True

print ((df.notnull()).astype('int'))
   a  b
0  0  1
1  1  0
2  0  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.

e.g. if df['col1'] is the existing column

df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)

where col2 is the new column. Should also work if col2 has string entries.

tompiler
  • 321
  • 2
  • 11
4

I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.

I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.

from __future__ import division, print_function

import numpy as np
import pandas as pd
import datetime as dt


# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan

df = pd.DataFrame(data=data)

trials = np.arange(100)


d1 = dt.datetime.now()

for r in trials:
    new_df = df.notnull().astype(int)

print( (dt.datetime.now()-d1).total_seconds()/trials.size )


# create a dummy copy of df.  I use a dummy copy here to prevent biasing the 
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()

d1 = dt.datetime.now()

for r in trials:
    df_dummy[df.isnull()] = 0
    df_dummy[df.isnull()==False] = 1

print( (dt.datetime.now()-d1).total_seconds()/trials.size )

This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.

tnknepp
  • 5,888
  • 6
  • 43
  • 57
2

There is a method .fillna() on DataFrames which does what you need. For example:

df = df.fillna(0)  # Replace all NaN values with zero, returning the modified DataFrame

or

df.fillna(0, inplace=True)   # Replace all NaN values with zero, updating the DataFrame directly
mfitzp
  • 15,275
  • 7
  • 50
  • 70
DainDwarf
  • 1,651
  • 10
  • 19
  • This only partially answers the question, the question is about replacing both NaN and non-NaN. – cglacet Aug 10 '21 at 12:35
2

for fmarc 's answer:

df.loc[~df.isnull()] = 1  # not nan
df.loc[df.isnull()] = 0   # nan

The code above does not work for me, and the below works.

df[~df.isnull()] = 1  # not nan
df[df.isnull()] = 0   # nan

With the pandas 0.25.3

And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:

change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Xin Niu
  • 533
  • 1
  • 5
  • 15
2

Try this one:

df.notnull().mul(1)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Alpha
  • 2,372
  • 3
  • 21
  • 23
1

Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1

this below line will change your column to 0

df.YourColumnName.fillna(0,inplace=True)

Now Rest of the Not Nan Part will be Replace by 1 by below code

df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)

Same Can Be applied to the total dataframe by not defining the column Name

  • What about values at 0 at the beginning? – B. Go Jun 17 '19 at 19:17
  • @B.Go that will change the NaN values to 0 to check 0 on second line.i tried to Check with the second line as if 'NaN' but that was an error.You can't,To check NaN like that you need to use Numpy .So this was the easy way for me. – arshad anzar Jun 18 '19 at 16:25
0

Use: df.fillna(0)

to fill NaN with 0.

afuc func
  • 942
  • 1
  • 7
  • 12
0

Generally there are two steps - substitute all not NAN values and then substitute all NAN values.

  1. dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
  2. dataframe.fillna(0) - this line will replace all NANs to 0

Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Andrew
  • 41
  • 2