0

I just converted to Python from R, and now I'm trying to read in data from a csv file. I was very annoyed with all my integer columns being treated as floats, and after some digging I see that this is the problem: NumPy or Pandas: Keeping array type as integer while having a NaN value

I see that the accepted answer gives me a hint as to where to go, but problem is that I have data with hundreds of columns, as is typical when doing data science, I suppose. So I don't want to specify for every column what type to use when reading in data with read_csv. This is fixed automatically in R.

Is it really this hard to use pandas to read in data in a proper way in Python?

Source: https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

Helen
  • 533
  • 12
  • 37
  • @sammywemmy: how do you suggest I go about sharing sample data? Do you want me to share a csv file? – Helen Apr 06 '21 at 09:46
  • @Rafaelars: As I'm saying in my question, I do not want to explicitly state the types of all columns. And no, all columns are not integers. – Helen Apr 06 '21 at 09:47
  • @sammywemmy I mean, it is well known that `R` is able to do this, and that Python is not, so why must I show this? Read this for context: https://pandas.pydata.org/pandas-docs/version/0.24/development/extending.html#extending-extension-types – Helen Apr 06 '21 at 09:50
  • and/or see: https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support – Helen Apr 06 '21 at 09:54

1 Answers1

1

You can try using:

df = pd.read_csv('./file.csv', dtype='Int64')

Edit: So that doesn't work for strings. Instead, try something like this:

for col in df.columns[df.isna().any()].tolist():
    if df[col].dtype == 'float':
        df[col] = df[col].astype('Int64')

Loop through each column that has an NA value and check it has type of float, then convert them to Int64

dlever
  • 21
  • 5
  • that will not work when the data has string types – Helen Apr 06 '21 at 10:07
  • Thank you for the contribution! That will work, I'm just seeking something that will fix this automatically for me. I'm surprised Python can't handle this in a nice way. – Helen Apr 06 '21 at 10:32
  • Happy to help! I don't think it's a problem with Python per se but more so a quirk of the Pandas library. You could possibly create your own `read_csv` function incorporating the above code, possibly as a module so that you can easily import it in the future? – dlever Apr 06 '21 at 11:01
  • I tried this and get `TypeError: cannot safely cast non-equivalent float64 to int64` although I used `Int64` as in the code above. My Pandas version is '0.24.2' ` – jung rhew Apr 14 '21 at 19:02