1

I am trying to read a csv file, of which a sample:

datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597

The problem I encounter is that in columns with only ints, the 0's are converted to NaN (eg columns 'check' and 'status', these are columns with only ints, but the column is read as floats because there are real missing values). But I only want the empty values to be converted to NaN, and not the zeros.

This is what I get:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358     NaN -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354     NaN   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358     NaN  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358     NaN    648
2012-10-27 15:06:03    NaN  51.195076  4.444407     2.349 -17.289     NaN   4909
2012-10-27 15:06:04    NaN  51.195182  4.444270     2.344 -17.289     NaN    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

So, in the columns 'check' and 'status', there are to many NaN's. In the 'lat' and 'lon' columns the 0's are not converted to NaN's.

  • Using na_values='' and keep_default_na=False does not help. Is there a way to specify to not convert int 0's to NaN? Or is this a bug?

  • I could specify the dtype of the specific columns as int with the dtype keyword. This keeps the 0's as 0's, but the problem is that those columns also contain real NaN's (empty values). So, in this case, these values are also converted to 0's as in an int column you cannot have NaN's. For this reason, I have to keep all columns as floats.


EDIT: after upgrading to pandas 0.10.1, it works as expected even without specifying keep_default_na and na_values:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597
joris
  • 133,120
  • 36
  • 247
  • 202
  • Does this answer your question? [Get pandas.read\_csv to read empty values as empty string instead of nan](https://stackoverflow.com/questions/10867028/get-pandas-read-csv-to-read-empty-values-as-empty-string-instead-of-nan) – dank8 Mar 03 '23 at 02:48

1 Answers1

5

You have to first set keep_default_na to False:

df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')

In [2]: df
Out[2]: 
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

From the doc-string of read_tables:

keep_default_na : bool, default True
     If na_values are specified and keep_default_na is False the default NaN
    values are overridden, otherwise they're appended to

na_values : list-like or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535