10

I used the following code to read csv, by specifying the types for each col:

clean_pdf_type=pd.read_csv('table_updated.csv',usecols=col_names,dtype =col_types)

But it has the error:

ValueError: Integer column has NA values in column 298 

Not sure how to skip the NA?

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Just import without specifying types for cols with Null values. It will parse into the datatype you want, and if it doesn't you can always convert it and rewrite the csv so that there wont be any problem in future reads – Andre Motta Aug 24 '18 at 10:18
  • @AndreMotta thanks, could you give an example? –  Aug 24 '18 at 10:20
  • 2
    check the answer by Alexis... am on my phone and didn't want to make syntax mistakes here. – Andre Motta Aug 24 '18 at 10:29

2 Answers2

11

Pandas v0.24+

See NumPy or Pandas: Keeping array type as integer while having a NaN value

Pandas pre-v0.24

You cannot have NaN values in an int dtype series. This is non-avoidable, because NaN values are considered float:

import numpy as np
type(np.nan)  # float

Your best bet is to read in these columns as float instead. If you are then able to replace NaN values by a filler value such as 0 or -1, you can process accordingly and convert to int:

int_cols = ['col1', 'col2', 'col3']
df[int_cols] = df[int_cols].fillna(-1)
df[int_cols] = df[int_cols].apply(pd.to_numeric, downcast='integer')

The alternative of having mixed int and float values will result in a series of dtype object. It is not recommended.

jpp
  • 159,742
  • 34
  • 281
  • 339
4
clean_pdf_type=pd.read_csv('table_updated.csv',usecols=col_names)
clean_pdf_type = (clean_pdf_type.fillna(0)).astype(col_types)

As said in the comments, don't specify the type, remove the NA and then cast to a certain type

Frayal
  • 2,117
  • 11
  • 17
  • Be careful. This will apply to the entire dataframe. It's likely there may be more than one type defined in `col_types`. – jpp Aug 24 '18 at 10:31
  • i know, but without anymore information, casting to 0 will be the easiest and fastest way to find specific errors. We could cast nan according to the type but for now we only have an integer error. If there is anything else that raises a flag, then sure, doing a separation of the dataframe to make sure we don't cast everything the same way would be usefull. – Frayal Aug 24 '18 at 10:37
  • An easier and faster way is to understand your data and change `col_types` accordingly to read as `float` instead of `int` where applicable. This solution basically says "let's look for an error and go back and make some changes." – jpp Aug 24 '18 at 10:38
  • that exactly what it says: look for an error, make changes and learn from your mistakes. you could'nt have phrase it more perfectly! – Frayal Aug 24 '18 at 12:35