1

I have a large csv file formatted from a website. It has over 60 columns. If any value in a column that should be interpreted as an integer is an empty value, every value in that column is interpreted as float64.

Is there a way to default this to int64 and avoid the tedious task of defining the datatype for each column?

Here is an example source csv file with no blanks:

FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
4,5,"Robin",$24.50,6
7,8,"Robin",$24.50,9
10,11,"Robin",$24.50,12

Here are the results:

>>> df_test = pd.read_csv('test_csv.csv')
>>> df_test
   FirstNumber   SecondNumber   Name   Price   code
0            3              2  Robin  $24.50      1
1            4              5  Robin  $24.50      6
2            7              8  Robin  $24.50      9
3           10             11  Robin  $24.50     12
>>> df_test.dtypes
FirstNumber       int64
 SecondNumber     int64
 Name            object
 Price           object
 code             int64
dtype: object

If I modify the csv file giving some rows empty numeric column values, any column with a NaN, has every value in that column default to float64. Two examples are provided below:

Example 1 of CVS Source Data

FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
,5,"Robin",$24.50,6
7,8,"Robin",$24.50,9
10,11,"Robin",$24.50,

Dataframe population results. FirstNumber and code columns cast as float64:

>>> df_test = pd.read_csv('test_csv.csv')
>>> df_test
   FirstNumber   SecondNumber   Name   Price   code
0        3.000              2  Robin  $24.50  1.000
1          nan              5  Robin  $24.50  6.000
2        7.000              8  Robin  $24.50  9.000
3       10.000             11  Robin  $24.50    nan
>>> df_test.dtypes
FirstNumber      float64
 SecondNumber      int64
 Name             object
 Price            object
 code            float64
dtype: object


**Example 2 of csv source data**  

In this example, SecondNumber and Price have empty values somewhere.

FirstNumber, SecondNumber, Name, Price, code
3,2,"Robin",$24.50,1
4,,"Robin",$24.50,6
7,8,"Robin",,9
10,11,"Robin",$24.50,12

Here's the output. SecondNumber cast as float64 as predicted, Price still cast as object:

>>> df_test
   FirstNumber   SecondNumber   Name   Price   code
0            3          2.000  Robin  $24.50      1
1            4            nan  Robin  $24.50      6
2            7          8.000  Robin     NaN      9
3           10         11.000  Robin  $24.50     12
>>> df_test.dtypes
FirstNumber        int64
 SecondNumber    float64
 Name             object
 Price            object
 code              int64
dtype: object

Why, when one value in a specific column is empty in the csv file, does one datatype (int64 changes to float64) change while another (object) does not?

Is there an elegant way to tell pandas to preserve the int64 type in such situations?

rabbitholefinder
  • 129
  • 2
  • 13
  • 4
    The issue is that each column of a pandas DataFrame must have the same type. There is no numpy integer representation of `NaN`, so your column cannot exist as an `int64` column. You can either consider keeping the entire column as `object`, or better, converting the `NaN` values to a dummy integer, like `-777` though this is annoying to keep track of, and ruins a lot of the built in `null` handling – ALollz Aug 14 '18 at 16:19
  • 3
    See https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value and http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na – ALollz Aug 14 '18 at 16:21
  • I noticed something interesting. Notice that null values for columns cast as object show a value of NaN. However, the null values for columns cast as float64 show up as nan (lowercase). I found that searching using isnull() did not find these?? – rabbitholefinder Aug 14 '18 at 18:04

0 Answers0