Scientific notation being read as string in pandas

Question

I'm trying to read a .csv with a column containing numbers in scientific notation. No matter what I do, it ends up reading them as string:

def readData(path, cols):
    types  = [str, str, str, str, np.float32]
    t_dict = {key: value for (key, value) in zip(c, types)}

    df = pd.read_csv(path, header=0, sep=';', encoding='latin1', usecols=cols, dtype=t_dict, chunksize=5000)

    return df

c = [3, 6, 7, 9, 16]
df2017_chunks = readData('Data/2017.csv', c)

def preProcess(df, f):    
    df.columns = f
    df['id_client'] = df['id_client'].apply(lambda x: str(int(float(x))))

    return df

f = ['issue_date', 'channel', 'product', 'issue', 'id_client']

df = pd.DataFrame(columns=f)
for chunk in df2017_chunks:
    aux = preProcess(chunk, f)
    df = pd.concat([df, aux])

How can I proper read this data?

Can you post a small sample out of the CSV which pandas is trying to read? — cardamom, May 24 '17 at 13:24
Very similar question: [Pandas read scientific notation and change](https://stackoverflow.com/questions/34013790/pandas-read-scientific-notation-and-change) — Herpes Free Engineer, Jul 12 '18 at 18:31

score 1 · Answer 1 · answered May 24 '17 at 13:42

Your preprocess function applies the string transformation after the others were applied. Is this intended behavior?

Could you try:

df = pd.read_csv(path, header=0, sep=';', encoding='latin1', usecols=cols, chunksize=5000)
df["id_client"] = pd.to_numeric(df["id_client"])

Patrick Hingston · Answer 2 · 2017-05-24T14:51:34.547

Sample dataframe:

df = pd.DataFrame({'issue_date': [1920,1921,1922,1923,1924,1925,1926],
    'name': ['jon doe1','jon doe2','jon doe3','jon doe4','jon doe5','jon doe6','jon doe7'],
    'id_cleint': ['18.61', '17.60', '18.27', '16.18', '16.81', '16.37', '67.07']})

You can check the dataframe's types with the follow command

print df.dtypes

output:

id_client     object
issue_date     int64
name          object
dtype: object

convert df['id_client'] dtype from object to float64 using the following command:

df['id_client'] =  pd.to_numeric(df['id_client'], errors='coerce')

errors='coerce' will result in NaN when an item cannot be converted. Using the command
print df.dtypes results in the following output:

id_client     float64
issue_date      int64
name           object
dtype: object

Scientific notation being read as string in pandas

2 Answers2