I have a large csv file formatted from a website. It has over 60 columns. If any value in a column that should be interpreted as an integer is an empty value, every value in that column is interpreted as float64.
Is there a way to default this to int64 and avoid the tedious task of defining the datatype for each column?
Here is an example source csv file with no blanks:
FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
4,5,"Robin",$24.50,6
7,8,"Robin",$24.50,9
10,11,"Robin",$24.50,12
Here are the results:
>>> df_test = pd.read_csv('test_csv.csv')
>>> df_test
FirstNumber SecondNumber Name Price code
0 3 2 Robin $24.50 1
1 4 5 Robin $24.50 6
2 7 8 Robin $24.50 9
3 10 11 Robin $24.50 12
>>> df_test.dtypes
FirstNumber int64
SecondNumber int64
Name object
Price object
code int64
dtype: object
If I modify the csv file giving some rows empty numeric column values, any column with a NaN, has every value in that column default to float64. Two examples are provided below:
Example 1 of CVS Source Data
FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
,5,"Robin",$24.50,6
7,8,"Robin",$24.50,9
10,11,"Robin",$24.50,
Dataframe population results. FirstNumber and code columns cast as float64:
>>> df_test = pd.read_csv('test_csv.csv')
>>> df_test
FirstNumber SecondNumber Name Price code
0 3.000 2 Robin $24.50 1.000
1 nan 5 Robin $24.50 6.000
2 7.000 8 Robin $24.50 9.000
3 10.000 11 Robin $24.50 nan
>>> df_test.dtypes
FirstNumber float64
SecondNumber int64
Name object
Price object
code float64
dtype: object
**Example 2 of csv source data**
In this example, SecondNumber and Price have empty values somewhere.
FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
4,,"Robin",$24.50,6
7,8,"Robin",,9
10,11,"Robin",$24.50,12
Here's the output. SecondNumber cast as float64 as predicted, Price still cast as object:
>>> df_test
FirstNumber SecondNumber Name Price code
0 3 2.000 Robin $24.50 1
1 4 nan Robin $24.50 6
2 7 8.000 Robin NaN 9
3 10 11.000 Robin $24.50 12
>>> df_test.dtypes
FirstNumber int64
SecondNumber float64
Name object
Price object
code int64
dtype: object
Why, when one value in a specific column is empty in the csv file, does one datatype (int64 changes to float64) change while another (object) does not?
Is there an elegant way to tell pandas to preserve the int64 type in such situations?