0

I have this CSV file

id,adset_id,source
1,,google
2,23843814084680281,facebook
3,,google
4,23843814088700279,facebook
5,23843704830370464,facebook

My problem is when I am trying to read it with panda since I can not pass the schema panda infer the schema for adset_id column to be float64 (because of NaN value)

So if I write this

import pandas as pd

df = pd.read_csv('/Users/test/Desktop/float.csv')
print(df)

I will get scientific notation for adset_id result:

   id      adset_id    source
0   1           NaN    google
1   2  2.384381e+16  facebook
2   3           NaN    google
3   4  2.384381e+16  facebook
4   5  2.384370e+16  facebook

I could not find any way to fix this so I tried to do a hack and convert this number to String. But in order to do that, I need to convert it to int64 first and after that convert it to string.

import pandas as pd
import numpy as np

df = pd.read_csv('/Users/test/Desktop/float.csv')

df = df.fillna({'adset_id':-1})
df['adset_id'] = df['adset_id'].astype('int64')
df['adset_id'] = df['adset_id'].astype('str')
df['adset_id'].replace('-1', np.NaN, inplace=True)

print(df)

The result is:

   id           adset_id    source
0   1                NaN    google
1   2  23843814084680280  facebook
2   3                NaN    google
3   4  23843814088700280  facebook
4   5  23843704830370464  facebook

As you can see 2 of my adset_id get rounded:
23843814084680281 -> 23843814084680280
23843814088700279 -> 23843814088700280

I just want to be able to read this CSV to panda data frame and don't get adset_id as scientific notation, any solution would be appreciated

Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125
  • Use `pd.read_csv('/Users/test/Desktop/float.csv', dtype={'adset_id': object})` – harvpan Sep 24 '19 at 14:54
  • 1
    Possible duplicate of [Import pandas dataframe column as string not int](https://stackoverflow.com/questions/13293810/import-pandas-dataframe-column-as-string-not-int) – harvpan Sep 24 '19 at 14:54

2 Answers2

1

Within pd.read_csv. Look at the dtype argument. You can set a dictionary of dtypes to ensure it is read as a string.

df = pd.read_csv('PATH_TO_CSV.csv', dtype={'adset_id':str})

You can also look at the na_values, keep_default_na, and na_filter arguments to help with handling NULLs

MattR
  • 4,887
  • 9
  • 40
  • 67
0

The "conversion" to scientific notation is occurring in pandas in the way it presents the data. Try adding the following code right after you import pandas.

import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
ParalysisByAnalysis
  • 703
  • 1
  • 4
  • 16