I get value errors when trying to read in the csv file to my datatype. I need to be sure that it works and that every line is read in and is correct.
Errors are for example:
Pandas: ValueError: Integer column has NA values in column 2
I am trying to cast to integer in Pandas Python library, but there is a value.
However, the csv file that I read in seems to have some erroneous entries, as it consists of manually entered test results.
I read in using this command:
test = pd.read_csv("test.csv", sep=";", names=pandasframe_names, dtype=pandasframe_datatypes, skiprows=1)
names is A, B, C, D and E and is defined correctly.
If there is an erroneous entry, I need a way of handling this without losing the full row.
So here is my case: I have a pandas dataframe that reads in a csv table with 5 columns with the headers A, B, C, D, E. I skip row one with the parameter skiprows=1
pandas_datatypes={'A': pd.np.int64, 'B':pd.np.int64, 'C':pd.np.float64, 'D':object, 'E':object}
My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string)
Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes
Now I have entries like so:
entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)
How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.