Replace invalid values with None in Pandas DataFrame

Question

Is there any method to replace values with None in Pandas in Python?

You can use df.replace('pre', 'post') and can replace a value with another, but this can't be done if you want to replace with None value, which if you try, you get a strange result.

So here's an example:

df = DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df.replace('-', 0)

which returns a successful result.

But,

df.replace('-', None)

which returns a following result:

0
0   - // this isn't replaced
1   3
2   2
3   5
4   1
5  -5
6  -1
7  -1 // this is changed to `-1`...
8   9

Why does such a strange result be returned?

Since I want to pour this data frame into MySQL database, I can't put NaN values into any element in my data frame and instead want to put None. Surely, you can first change '-' to NaN and then convert NaN to None, but I want to know why the dataframe acts in such a terrible way.

Tested on pandas 0.12.0 dev on Python 2.7 and OS X 10.8. Python is a pre-installed version on OS X and I installed pandas by using SciPy Superpack script, for your information.

Yup. You encounter `InternalError: (1054, u"Unknown column 'nan' in 'field list'")` error. I don't know about any solutions on it other than converting `NaN` to `None` before executing `write_frame` method. — Blaszard, Jun 13 '13 at 21:40
Scipy super pack gives out dev? Ok, well I definitely think you should [raise this as an issue on github](https://github.com/pydata/pandas/issues), shouldn't be too hard to fix. — Andy Hayden, Jun 13 '13 at 21:49
**If you are reading this data from CSV/Excel**, you can read these values in as NaN using `na_values` argument. [More information in this answer.](https://stackoverflow.com/a/55469393/4909087) — cs95, Apr 29 '19 at 10:11
In case anyone is here primarily due to the second questionable part of the replace (where a value of `-1` is introduced on record 7), this is due to a default value of `'pad'` as the `method` parameter. If a maintainer of pandas happens to be reading this, that default behavior is _NOT_ desirable in virtually any environment I've experienced and easily results in data corruption! — bsplosion, Aug 24 '20 at 18:28

score 141 · Accepted Answer · answered Jun 13 '13 at 21:29

141

Actually in later versions of pandas this will give a TypeError:

df.replace('-', None)
TypeError: If "to_replace" and "value" are both None then regex must be a mapping

You can do it by passing either a list or a dictionary:

In [11]: df.replace('-', df.replace(['-'], [None]) # or .replace('-', {0: None})
Out[11]:
      0
0  None
1     3
2     2
3     5
4     1
5    -5
6    -1
7  None
8     9

But I recommend using NaNs rather than None:

In [12]: df.replace('-', np.nan)
Out[12]:
     0
0  NaN
1    3
2    2
3    5
4    1
5   -5
6   -1
7  NaN
8    9

answered Jun 13 '13 at 21:29

Andy Hayden

359,921
101
625
535

18

Or simply a list, e.g. `df.replace(['-'], [None])`, or `df.replace({'-': None})`, I think. The use of `None` as a sentinel precludes using it as a value too.. – DSM Jun 13 '13 at 21:30
@user2360798 replace is actually a very feature-rich (read complicated) function, the [(dev)docstring](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html) is really good though. – Andy Hayden Jun 13 '13 at 21:40
4

I don't know if it's obvious, but had to assign `df` back to itself like: `df = df.replace({'?': np.nan})` – luckyging3r Sep 26 '16 at 00:22
3

@AndyHayden `df.replace('-', df.replace(['-'], [None])` looks funky, is that a typo? – lin_bug Mar 17 '17 at 11:11
@lin_bug I think should just be df.replace(['-'], [None]) :) good catch! – Andy Hayden Mar 17 '17 at 14:08
2

@lin_bug Though it seems the no longer works in recent pandas versions. df.where(df!='-', None) works – Andy Hayden Mar 17 '17 at 14:13
I don't know which version of pandas would return that error. I have pandas 0.23.4, and there was no TypeError for me. pandas displayeds the behavior explained by @Blaszard: the vlue is copied from the row before instead of being set to None – robertspierre Apr 17 '19 at 06:58
@raffamaiden `df.where(df!='-', None)` seems to work, perhaps the only one to work. Yes, this answer is rather old (nearly 6 years! I guess that deprecation/warning/TypeError was removed), definitely agree with cs95 that it should be parsed during read_csv. – Andy Hayden Apr 17 '19 at 22:08
@AndyHayden I agree with him to, but still, is it strange behavior of assigning the value of the previous row to None documented somewhere or is it a bug? – robertspierre Apr 18 '19 at 22:59
@raffamaiden it may be worth creating an issue, the problem is that None in python is used for all optional arguments so pandas can't tell whether it's .replace(x) vs .replace(x, None)... that said this is strange behavior either way – Andy Hayden Apr 18 '19 at 23:03

Michael Dorner · Answer 2 · 2018-10-25T20:40:29.567

36

I prefer the solution using replace with a dict because of its simplicity and elegance:

df.replace({'-': None})

You can also have more replacements:

df.replace({'-': None, 'None': None})

And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.

edited Oct 25 '18 at 20:40

answered Mar 21 '18 at 12:31

Michael Dorner

17,587
13
87
117

2

It is worth noting that part of why this technique works is that usage of the `dict` type in `to_replace` causes the `method` parameter to not be evaluated, and thus the `method='pad'` default to have no ill effects. – bsplosion Aug 24 '20 at 18:29

score 18 · Answer 3 · answered Apr 09 '14 at 21:38

18

where is probably what you're looking for. So

data=data.where(data=='-', None)

From the panda docs:

where [returns] an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other).

answered Apr 09 '14 at 21:38

user2966041

527
6
9

7

This is actually inaccurate. data=data.where(data=='-', None) will replace anything that is NOT EQUAL to '-' with None. Pandas version of where keeps the value of the first arg(in this case data=='-'), and replace anything else with the second arg (in this case None). It is a bit confusing as np.where is more explicit in that it asks the conditional in the first arg, then the if true in the second arg, then the if false in the 3rd arg. – clg4 Jul 04 '17 at 23:10

cs95 · Answer 4 · 2019-12-08T09:25:32.010

Before proceeding with this post, it is important to understand the difference between NaN and None. One is a float type, the other is an object type. Pandas is better suited to working with scalar types as many methods on these types can be vectorised. Pandas does try to handle None and NaN consistently, but NumPy cannot.

My suggestion (and Andy's) is to stick with NaN.

But to answer your question...

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

If you loaded this data from CSV/Excel, I have good news for you. You can quash this at the root during data loading instead of having to write a fix with code as a subsequent step.

Most of the pd.read_* functions (such as read_csv and read_excel) accept a na_values attribute.

file.csv

A,B
-,1
3,-
2,-
5,3
1,-2
-5,4
-1,-1
-,0
9,0

Now, to convert the - characters into NaNs, do,

import pandas as pd
df = pd.read_csv('file.csv', na_values=['-'])
df

     A    B
0  NaN  1.0
1  3.0  NaN
2  2.0  NaN
3  5.0  3.0
4  1.0 -2.0
5 -5.0  4.0
6 -1.0 -1.0
7  NaN  0.0
8  9.0  0.0

And similar for other functions/file formats.

P.S.: On v0.24+, you can preserve integer type even if your column has NaNs (yes, talk about having the cake and eating it too). You can specify dtype='Int32'

df = pd.read_csv('file.csv', na_values=['-'], dtype='Int32')
df

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

df.dtypes

A    Int32
B    Int32
dtype: object

The dtype is not a conventional int type... but rather, a Nullable Integer Type. There are other options.

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`

If you're dealing with numeric data, a faster solution is to use pd.to_numeric with the errors='coerce' argument, which coerces invalid values (values that cannot be cast to numeric) to NaN.

pd.to_numeric(df['A'], errors='coerce')

0    NaN
1    3.0
2    2.0
3    5.0
4    1.0
5   -5.0
6   -1.0
7    NaN
8    9.0
Name: A, dtype: float64

To retain (nullable) integer dtype, use

pd.to_numeric(df['A'], errors='coerce').astype('Int32')

0    NaN
1      3
2      2
3      5
4      1
5     -5
6     -1
7    NaN
8      9
Name: A, dtype: Int32

To coerce multiple columns, use apply:

df[['A', 'B']].apply(pd.to_numeric, errors='coerce').astype('Int32')

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

...and assign the result back after.

More information can be found in this answer.

Asclepius · Answer 5 · 2022-02-16T18:26:32.807

8

With Pandas version ≥1.0.0, I would use DataFrame.replace or Series.replace:

df.replace(old_val, pd.NA, inplace=True)

This is better for two reasons:

It uses pd.NA instead of None or np.nan.
It optionally works in-place which could be more memory efficient depending upon the internal implementation.

edited Feb 16 '22 at 18:26

answered Oct 14 '20 at 21:57

Asclepius

57,944
17
167
143

score 3 · Answer 6 · answered Jan 13 '19 at 02:21

3

df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df = df.where(df!='-', None)

answered Jan 13 '19 at 02:21

Shravan kp

109
1
3

Freek Wiekmeijer · Answer 7 · 2019-03-15T08:26:17.747

0

Setting null values can be done with np.nan:

import numpy as np
df.replace('-', np.nan)

Advantage is that df.last_valid_index() recognizes these as invalid.

edited Mar 15 '19 at 08:26

answered Mar 15 '19 at 08:15

Freek Wiekmeijer

4,556
30
37

score 0 · Answer 8 · answered Dec 28 '19 at 12:30

0

Using replace and assigning a new df:

import pandas as pd
df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
dfnew = df.replace('-', 0)
print(dfnew)


(venv) D:\assets>py teste2.py
   0
0  0
1  3
2  2
3  5
4  1
5 -5

answered Dec 28 '19 at 12:30

daniel rocha

43
1
6

score 0 · Answer 9 · answered May 21 '20 at 16:05

0

df.replace('-', np.nan).astype("object")

This will ensure that you can use isnull() later on your dataframe

answered May 21 '20 at 16:05

Keng Chan

21
2

score 0 · Answer 10 · answered May 04 '22 at 20:44

0

Alternatively you can also use mask:

df.mask(df=='-', None)

answered May 04 '22 at 20:44

rachwa

1,805
1
14
17

Replace invalid values with None in Pandas DataFrame

10 Answers10

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`

Linked

Related

Replace invalid values with None in Pandas DataFrame

10 Answers10

pandas >= 0.18: Use na_values=['-'] argument with read_csv

Handling Numeric Data: pd.to_numeric with errors='coerce

Linked

Related

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`