12

I get value errors when trying to read in the csv file to my datatype. I need to be sure that it works and that every line is read in and is correct.

Errors are for example:

Pandas: ValueError: Integer column has NA values in column 2

I am trying to cast to integer in Pandas Python library, but there is a value.

However, the csv file that I read in seems to have some erroneous entries, as it consists of manually entered test results.

I read in using this command:

test = pd.read_csv("test.csv", sep=";", names=pandasframe_names, dtype=pandasframe_datatypes, skiprows=1)

names is A, B, C, D and E and is defined correctly.

If there is an erroneous entry, I need a way of handling this without losing the full row.

So here is my case: I have a pandas dataframe that reads in a csv table with 5 columns with the headers A, B, C, D, E. I skip row one with the parameter skiprows=1

pandas_datatypes={'A': pd.np.int64, 'B':pd.np.int64, 'C':pd.np.float64, 'D':object, 'E':object}

My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string)

Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes

Now I have entries like so:

entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)

How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
Smiley
  • 479
  • 2
  • 5
  • 15
  • Does this answer your question? [Convert Pandas column containing NaNs to dtype \`int\`](https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int) – Ludo Jul 13 '22 at 14:11

2 Answers2

18

Pandas now has extension types, for which integer support NA values. You will get pd.NA in those fields.

https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

Use Pandas Int64 type, you'll be fine!

pandas_datatypes={'A': 'Int64', 'B': 'Int64', 'C':pd.np.float64, 'D':object, 'E':object}

Just tested it with pandas 1.3.5, works like a charm.

Etienne
  • 901
  • 6
  • 15
5

Since you have incomplete/corrupt data (very common!), you can't enforce dtypes from the outset. You have to first import it as is:

Given file1.csv:

5; 5; 2.2; pedagogy; teacher
8; 7.0; 2.2; pedagogy; teacher
NA; 5; 2.2; pedagogy; teacher
none; 5; 2.2; pedagogy; teacher
8; 5; 2; pedagogy; teacher

We can read it as

df = pd.read_csv('file1.csv', sep=';', header=None, names=['A', 'B', 'C', 'D', 'E'])

We then coerce A, B and C to numeric, forcing NaN wherever there is text.

for col in ['A', 'B', 'C']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

You then have the C as a float col. Making A and B into ints is a bit trickier, since they have NaN values (a known Pandas issue).

If you have 0.24 or higher, you can do:

df['A'] = df['A'].astype(pd.Int64Dtype())
df['B'] = df['B'].astype(pd.Int64Dtype())

Otherwise you can either fill the NaNs in some way, or live with the floats.

Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
  • How do I deal with forcing the number types into strings? – Smiley Mar 03 '20 at 11:31
  • my datamodel has type varchar in sqllite, so maybe string would be wrong. what is the best way to force this on import to varchar? – Smiley Mar 03 '20 at 13:38
  • columns D and E have type varchar in my datamodel. I use sqlalchemy ORM. When I import, the data model I get in SQLLite TEXT. But I need varchar. Also both the pd.Int64Dtype() and pd.Int32Dtype() yield BIGINT in SQLLite, but I need an INTEGER datatype on import. – Smiley Mar 03 '20 at 14:53
  • D and E are object type, which should be VARCHAR in SQL. As for the other issue, I don't know if it can be helped, maybe you can convert it in SQL after you add it to the DB? – Josh Friedlander Mar 03 '20 at 15:12
  • D and E become TEXT in SQL not varchar, any suggestion on how to convert this? – Smiley Mar 04 '20 at 08:50
  • See [here](https://www.sqlservercentral.com/forums/topic/converting-text-columns-to-varcharmax) but I think best to make this a new question – Josh Friedlander Mar 04 '20 at 11:36