74

My question is very similar to this one, but I need to convert my entire dataframe instead of just a series. The to_numeric function only works on one series at a time and is not a good replacement for the deprecated convert_objects command. Is there a way to get similar results to the convert_objects(convert_numeric=True) command in the new pandas release?

Thank you Mike Müller for your example. df.apply(pd.to_numeric) works very well if the values can all be converted to integers. What if in my dataframe I had strings that could not be converted into integers? Example:

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
df.dtypes
Out[59]: 
Words    object
ints     object
dtype: object

Then I could run the deprecated function and get:

df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[60]: 
Words    object
ints      int64
dtype: object

Running the apply command gives me errors, even with try and except handling.

Community
  • 1
  • 1
Bobe Kryant
  • 2,050
  • 4
  • 19
  • 32

4 Answers4

145

All columns convertible

You can apply the function to all columns:

df.apply(pd.to_numeric)

Example:

>>> df = pd.DataFrame({'a': ['1', '2'], 
                       'b': ['45.8', '73.9'],
                       'c': [10.5, 3.7]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null object
b    2 non-null object
c    2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes

>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null int64
b    2 non-null float64
c    2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytes

Not all columns convertible

pd.to_numeric has the keyword argument errors:

  Signature: pd.to_numeric(arg, errors='raise')
  Docstring:
  Convert argument to a numeric type.

Parameters
----------
arg : list, tuple or array of objects, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input

Setting it to ignore will return the column unchanged if it cannot be converted into a numeric type.

As pointed out by Anton Protopopov, the most elegant way is to supply ignore as keyword argument to apply():

>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

My previously suggested way, using partial from the module functools, is more verbose:

>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'], 
                       'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
user3313834
  • 7,327
  • 12
  • 56
  • 99
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • 11
    I think, the most elegant way to set this argument in the `apply` as keywarg: `df.apply(pd.to_numeric, errors='ignore')` should work fine. – Anton Protopopov Jan 18 '16 at 05:58
  • `to_numeric` does not do commas. – ChaimG Jun 08 '20 at 15:25
  • 1
    To get only integer numeric columns in the end, as the question stated, loop through all columns: `for i in df.columns: try: df[[i]] = df[[i]].astype(int) except: pass` – questionto42 Nov 30 '20 at 23:02
5

The accepted answer with pd.to_numeric() converts to float, as soon as it is needed. Reading the question in detail, it is about converting any numeric column to integer. That is why the accepted answer needs a loop over all columns to convert the numbers to int in the end.

Just for completeness, this is even possible without pd.to_numeric(); of course, this is not recommended:

df = pd.DataFrame({'a': ['1', '2'], 
                   'b': ['45.8', '73.9'],
                   'c': [10.5, 3.7]})

for i in df.columns:
    try:
        df[[i]] = df[[i]].astype(float).astype(int)
    except:
        pass

print(df.dtypes)

Out:

a    int32
b    int32
c    int32
dtype: object

EDITED: Mind that this not recommended solution is unnecessarily complicated; pd.to_numeric() can simply use the keyword argument downcast='integer' to force integer as output, thank you for the comment. This is then still missing in the accepted answer, though.

News again From a comment by user Gary, it turns out that "as of pandas 2.0.1, if input series contains empty string or None then the resulting dtype will still be float even when using downcast='integer'". That would mean that the first answer with .astype(float).astype(int) is alive again if you want to be sure to get only integers.

questionto42
  • 7,175
  • 4
  • 57
  • 90
  • 2
    If all of the 'numbers' are formatted as integers (i.e. `'5'`, not `'5.0'`) then the keyword argument `downcast='integer'` can be used in the `to_numeric` function to force the integer type: In this example ```df.apply(pd.to_numeric, downcast='integer')``` will return column `a` as integer – JJL Dec 29 '20 at 22:22
  • 1
    Note that as of pandas 2.0.1, if input series contains empty string or `None` then the resulting dtype will still be float even when using `downcast='integer'`. – Gary May 18 '23 at 18:44
1

you can use df.astype() to convert the series to desired datatype.

For example: my_str_df = [['20','30','40']]

then: my_int_df = my_str_df['column_name'].astype(int) # this will be the int type

P.R.
  • 133
  • 3
  • 2
    Downvote. The question was about a dataframe, not a series, and you do not explain how you would change a whole dataframe that also has float columns of type string like '45.8'. – questionto42 Nov 30 '20 at 22:48
0

apply() the pd.to_numeric with errors='ignore' and assign it back to the DataFrame:

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
print ("Orig: \n",df.dtypes)

df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric: \n",df.dtypes)

df = df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric with assign: \n",df.dtypes)

Output:

Orig: 
 ints     object
Words    object
dtype: object

to_numeric: 
 ints     object
Words    object
dtype: object

to_numeric with assign: 
 ints      int64
Words    object
dtype: object
Alon Lavian
  • 1,149
  • 13
  • 14
  • It goes without saying that you need to reassign the df if you want to save the changes. This should have been just a comment under the accepted solution. – questionto42 Nov 30 '20 at 23:16