51

Consider the following situation:

In [2]: a = pd.Series([1,2,3,4,'.'])

In [3]: a
Out[3]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

In [8]: a.astype('float64', raise_on_error = False)
Out[8]: 
0    1
1    2
2    3
3    4
4    .
dtype: object

I would have expected an option that allows conversion while turning erroneous values (such as that .) to NaNs. Is there a way to achieve this?

cs95
  • 379,657
  • 97
  • 704
  • 746
Korem
  • 11,383
  • 7
  • 55
  • 72

3 Answers3

86

Use pd.to_numeric with errors='coerce'

# Setup
s = pd.Series(['1', '2', '3', '4', '.'])
s

0    1
1    2
2    3
3    4
4    .
dtype: object

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

If you need the NaNs filled in, use Series.fillna.

pd.to_numeric(s, errors='coerce').fillna(0, downcast='infer')

0    1
1    2
2    3
3    4
4    0
dtype: float64

Note, downcast='infer' will attempt to downcast floats to integers where possible. Remove the argument if you don't want that.

From v0.24+, pandas introduces a Nullable Integer type, which allows integers to coexist with NaNs. If you have integers in your column, you can use

pd.__version__
# '0.24.1'

pd.to_numeric(s, errors='coerce').astype('Int32')

0      1
1      2
2      3
3      4
4    NaN
dtype: Int32

There are other options to choose from as well, read the docs for more.


Extension for DataFrames

If you need to extend this to DataFrames, you will need to apply it to each row. You can do this using DataFrame.apply.

# Setup.
np.random.seed(0)
df = pd.DataFrame({
    'A' : np.random.choice(10, 5), 
    'C' : np.random.choice(10, 5), 
    'B' : ['1', '###', '...', 50, '234'], 
    'D' : ['23', '1', '...', '268', '$$']}
)[list('ABCD')]
df

   A    B  C    D
0  5    1  9   23
1  0  ###  3    1
2  3  ...  5  ...
3  3   50  2  268
4  7  234  4   $$

df.dtypes

A     int64
B    object
C     int64
D    object
dtype: object

df2 = df.apply(pd.to_numeric, errors='coerce')
df2

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

df2.dtypes

A      int64
B    float64
C      int64
D    float64
dtype: object

You can also do this with DataFrame.transform; although my tests indicate this is marginally slower:

df.transform(pd.to_numeric, errors='coerce')

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

If you have many columns (numeric; non-numeric), you can make this a little more performant by applying pd.to_numeric on the non-numeric columns only.

df.dtypes.eq(object)

A    False
B     True
C    False
D     True
dtype: bool

cols = df.columns[df.dtypes.eq(object)]
# Actually, `cols` can be any list of columns you need to convert.
cols
# Index(['B', 'D'], dtype='object')

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
# Alternatively,
# for c in cols:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

df

   A      B  C      D
0  5    1.0  9   23.0
1  0    NaN  3    1.0
2  3    NaN  5    NaN
3  3   50.0  2  268.0
4  7  234.0  4    NaN

Applying pd.to_numeric along the columns (i.e., axis=0, the default) should be slightly faster for long DataFrames.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 3
    This is awesome :-) , We should keep the thing updated , since left deprecated method on this site is not good for the future visitor :-) , You did it – BENY Dec 23 '17 at 16:28
  • Aha, maybe you can add `s.str.isalnum()` :-) combine with `mask` – BENY Dec 23 '17 at 16:31
  • @Wen You mean `s.str.isdigit()`? It will only work for integers, not floats. Good idea though. – cs95 Dec 23 '17 at 16:32
  • 1
    @Wen as per I know `to_numeric` is gold here. – Bharath M Shetty Dec 23 '17 at 16:33
  • @cᴏʟᴅsᴘᴇᴇᴅ as of reference we can take this question https://stackoverflow.com/q/47001413/4800652 to consideration. – Bharath M Shetty Dec 23 '17 at 16:43
  • Thanks! However, using axis=0 yields an error in my Python 3.6.2: File "C:\Python\lib\site-packages\pandas\core\series.py", line 2294, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas\src\inference.pyx", line 1207, in pandas.lib.map_infer (pandas\lib.c:66124) File "C:\Python\lib\site-packages\pandas\core\series.py", line 2282, in f = lambda x: func(x, *args, **kwds) TypeError: to_numeric() got an unexpected keyword argument 'axis' – Peter May 24 '18 at 10:28
20
In [30]: pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
Out[30]: 
0     1
1     2
2     3
3     4
4   NaN
dtype: float64
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I think I took out the function of ``raise_on_error`` a while back. Doesn't do anything ATM. – Jeff Sep 20 '14 at 20:09
  • 1
    I opened an issue for enhancement here: https://github.com/pydata/pandas/issues/8332, feel free to comment on proposed API – Jeff Sep 20 '14 at 20:12
  • On 0.14.1 it prevents it from throwing an exception. Without specifying it the `astype` statement would've raise an error. – Korem Sep 20 '14 at 20:12
  • sorry you are right. That's why never JUST read the code, test it :) – Jeff Sep 20 '14 at 20:13
  • 8
    `.convert_objects()` method has been deprecated since 0.17, [`pd.to_numeric`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html) is the new way to go. – YaOzI Nov 03 '16 at 08:05
-2

Do this:

pd.to_numeric(s, errors='coerce')

Hemanta Nandi
  • 141
  • 1
  • 4