11

I have a data source where all the values are given as strings. When I create a Pandas dataframe from this data, all the columns are naturally of type object. I then want to let Pandas automatically convert any columns that look like numbers into a numeric types (e.g. int64, float64).

Pandas supposedly provides a function to do this automatic type inferencing: pandas.DataFrame.infer_objects(). It's also mentioned in this StackOverflow post. The documentation says:

Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.

However, the function is not working for me. In the reproducible example below, I have two string columns (value1 and value2) that unambiguously look like int and float values, respectively, but infer_objects() does not convert them from string to the appropriate numeric types.

import pandas as pd

# Create example dataframe.
data = [ ['Alice', '100', '1.1'], ['Bob', '200', '2.1'], ['Carl', '300', '3.1']]
df = pd.DataFrame(data, columns=['name', 'value1', 'value2'])

print(df)

#     name value1 value2
# 0  Alice    100    1.1
# 1    Bob    200    2.1
# 2   Carl    300    3.1

print(df.info())

# Data columns (total 3 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   name    3 non-null      object
#  1   value1  3 non-null      object
#  2   value2  3 non-null      object
# dtypes: object(3)

df = df.infer_objects() # Should convert value1 and value2 columns to numerics.

print(df.info())

# Data columns (total 3 columns):
#  #   Column  Non-Null Count  Dtype 
# ---  ------  --------------  ----- 
#  0   name    3 non-null      object
#  1   value1  3 non-null      object
#  2   value2  3 non-null      object
# dtypes: object(3)

Any help would be appreciated.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

5 Answers5

4

Or further to @wwnde same solution slightly different,

df["value1"] = pd.to_numeric(df["value1"])
df["value2"] = pd.to_numeric(df["value2"])

EDIT: This is an interesting question and I'm surprised that pandas doesn't convert obvious string floats and integers as you show.

However, this small code can get you through the dataframe and convert your columns.

data = [["Alice", "100", "1.1"], ["Bob", "200", "2.1"], ["Carl", "300", "3.1"]]
df = pd.DataFrame(data, columns=["name", "value1", "value2"])

print(df.info(), "\n")

RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   value1  3 non-null      object
 2   value2  3 non-null      object
dtypes: object(3)

cols = df.columns
for c in cols:
    try:
        df[c] = pd.to_numeric(df[c])
    except:
        pass

print(df.info())

RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    3 non-null      object 
 1   value1  3 non-null      int64  
 2   value2  3 non-null      float64
dtypes: float64(1), int64(1), object(1)
run-out
  • 3,114
  • 1
  • 9
  • 25
  • 5
    My question's point is that I would like Pandas to infer the types for me. I wouldn't know that `value1` and `value2` were numbers. – stackoverflowuser2010 May 09 '20 at 22:42
  • @stackoverflowuser2010 ```infer_objects``` wouldnt work the way you want, if you check the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.infer_objects.html) they arent taking the first row which has a string... – Hozayfa El Rifai May 09 '20 at 22:47
  • Interesting question. Also I came across this interesting [article](https://rushter.com/blog/pandas-data-type-inference/) that might help understanding. – run-out May 09 '20 at 23:08
  • The documentation for `infer_objects()` says: `The inference rules are the same as during normal Series/DataFrame construction.`. Whenever I run `pd.read_csv()` to build a new dataframe, that function correctly infers the data types. – stackoverflowuser2010 May 09 '20 at 23:15
3

df_new = df.convert_dtypes() may help. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

cpearce95
  • 31
  • 1
1

Inferring types has lots of edge cases. There is an open bug covering this issue.

In the meantime, a workaround could be to use the type inference in the CSV reader, which is quite good, and what most people arriving at this page were expecting. So, until someone refactors that code out from the reader into a more generic solution, this hack might get many people moving:

# given a Pandas DataFrame df, createa a CSV in memory and read it back in:

import io

stream = io.StringIO()
df.to_csv(stream, index=False)
stream.seek(0)
df = pd.read_csv(stream)
stream.close()

With all the usual caveats around speed and opening yourself up to other type conversion bugs.

Leo
  • 2,775
  • 27
  • 29
0

This function works okay. Add more string replaces for different currencies.

def convert_datatype(df):
    for column in df.columns:
        
        try:
            df[column] = df[column].str.replace('$','').str.replace('£','').str.replace(',','').astype(float)
        except (ValueError, TypeError):
            try:
                df[column] = df[column].astype(int)
            except (ValueError, TypeError):
                df[column] = df[column].astype(str)
    return df
0

A little late for the show, but sharing my findings.
For me also, infer_objects() did not give the results I was aiming for.
What gave me a hint was @stackoverflowuser2010 's comment:

The documentation for infer_objects() says:
"The inference rules are the same as during normal Series/DataFrame construction."
Whenever I run pd.read_csv() to build a new dataframe, that function correctly infers the data types.

So I figured, I'll build the data from csv:

import pandas as pd
from io import StringIO

data = [ ['Alice', '100', '1.1'], ['Bob', '200', '2.1'], ['Carl', '300', '3.1']]
columns=['name', 'value1', 'value2']
as_csv = '\n'.join(' '.join(l) for l in data)
df = pd.read_csv(StringIO(as_csv), sep=' ', header=None, names=columns)

print(df)  
#    name  value1  value2
# 0  Alice     100     1.1
# 1    Bob     200     2.1
# 2   Carl     300     3.1


print(df.info())  
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 3 columns):
#  #   Column  Non-Null Count  Dtype
# ---  ------  --------------  -----
#  0   name    3 non-null      object
#  1   value1  3 non-null      int64
#  2   value2  3 non-null      float64
# dtypes: float64(1), int64(1), object(1)
# memory usage: 200.0+ bytes

I admit, it is not very elegant, but it is the best solution I could find for setting arbitrary columns to their correct data type.

Avi Turner
  • 10,234
  • 7
  • 48
  • 75