0

I have a some data to convert to dataframe. Say, the below data for example.

df_raw = [
("Madhan", 0, 9.34),
("Kumar", None, 7.6)
]

When i convert this to a pandas dataframe, the int column is automatically getting converted to float.

pd.DataFrame(df_raw)

   0    1   2
   0    Madhan  0.0 9.34
   1    Kumar   NaN 7.60

How do i avoid this?

What i tried:
It's actually fine for me as long as the actual text of the elements in the dataframe doesn't change. So i tried defining the dataframe with column types as string or pd.StringDtype(), none of which work and give the same result.

pd.DataFrame(df_raw, dtype = str)

       0    1   2
   0    Madhan  0.0 9.34
   1    Kumar   NaN 7.6

pd.DataFrame(df_raw, dtype = pd.StringDtype())
   0    1   2
   0    Madhan  0.0 9.34
   1    Kumar   <NA>    7.6

Also, don't tell me to convert the integer columns to nullable int like pd.Int64Dtype() or Int64 because i wouldn't know which columns are integer columns as this is part of an automation.

Also, I can't go and change each element as string datatype because sometimes the dataframe might be huge and doing this might be take time.

Edit: convert_dtypes also doesn't work if the number is large, as shown.

df_raw = [
    ("Madhan", 5, 9.34, None),
    ("Kumar", None, 7.6, 34534543454)
]

pd.DataFrame(df_raw).convert_dtypes()

   0    1   2   3
   0    Madhan  5   9.34    <NA>
   1    Kumar   <NA>    7.6 34534543454.0
madhan01
  • 117
  • 5

2 Answers2

1

Use convert_dtypes to automatically infer the nullable dtypes:

df = pd.DataFrame(df_raw).convert_dtypes()

print(df)

        0     1     2            3
0  Madhan     5  9.34         <NA>
1   Kumar  <NA>   7.6  34534543454

print(df.dtypes)

0     string
1      Int64
2    Float64
3      Int64
dtype: object
mozway
  • 194,879
  • 13
  • 39
  • 75
  • this doesn't seem to work if the integer is large. I've edited the question to add this output as well. – madhan01 Jun 23 '23 at 13:47
  • @madhan01 yes this is normal, [python has no limit for integers](https://stackoverflow.com/questions/7604966/maximum-and-minimum-values-for-ints), pandas/numpy do. Although `34534543454` gets converted to Int64 without issue in my hands. – mozway Jun 23 '23 at 13:51
0

You cannot just convert to int a column that contains nans because nans are float. However, pandas has special types to handle this (Int8...Int64).

For example:

df = pd.DataFrame({'a': [1,2,3,np.nan]}]
df['a'] = df['a'].astype('Int64')

# df.a Output
0       1
1       2
2       3
3    <NA>
Name: a, dtype: Int64

There is a specific function to handle this (convert_dtypes):

df_raw = pd.DataFrame([
    ("Madhan", 5, 9.34, None),
    ("Kumar", None, 7.6, 34534543454)
])

df = df_raw.convert_dtypes()

# df output
    0   1   2   3
0   Madhan  5   9.34    <NA>
1   Kumar   <NA>    7.6 34534543454

EDIT: I actually checked with your code and it works.

100tifiko
  • 361
  • 1
  • 10