How to gracefully fallback to `NaN` value while reading integers from a CSV with Pandas?

Question

While using read_csv with Pandas, if i want a given column to be converted to a type, a malformed value will interrupt the whole operation, without an indication about the offending value.

For example, running something like:

import pandas as pd
import numpy as np


df = pd.read_csv('my.csv', dtype={ 'my_column': np.int64 })

Will lead to a stack trace ending with the error:

ValueError: cannot safely convert passed user dtype of <i8 for object dtyped data in column ...

If i had the row number, or the offending value in the error message, i could add it to the list of known NaN values, but this way there is nothing i can do.

Is there a way to tell the parser to ignore failures and return a np.nan in that case?

Post Scriptum: Funnily enough, after parsing without any type suggestion (no dtype argument), d['my_column'].value_counts() seems to infer the dtype right and put np.nan correctly automatically, even though the actual dtype for the series is a generic object which will fail on almost every plotting and statistical operation

You're explicitly specifying the dtype so it's expecting all values for that column to conform to that type, if you know the values to treat as `NaN` you can pass these to `na_values` however it looks like it's better to just let `read_csv` guess but this will mean that the dtype will be `float64` as `NaN` cannot be represented in `int64` — EdChum, May 12 '15 at 12:13
Does the `value_counts` indicate some value that would not be convertible to a number? (so that would cause the `object` dtype) — joris, May 12 '15 at 12:13
thanks, your comments led me to the solution, which i posted as answer. you are free to post your alternative answers if you want to add something — danza, May 12 '15 at 14:09

danza · Accepted Answer · 2015-05-13T07:46:37.463

8

Thanks to the comments i realised that there is no NaN for integers, which was very surprising to me. Thus i switched to converting to float:

import pandas as pd
import numpy as np


df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 })

This gave me an understandable error message with the value of the failing conversion, so that i could add the failing value to the na_values:

df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 }, na_values=['n/a'])

This way i could finally import the CSV in a way which works with visualisation and statistical functions:

>>>> df['session_planned_os'].dtype
dtype('float64')

Once you are able to spot the right na_values, you can remove the dtype argument from read_csv. Type inference will now happen correctly:

df = pd.read_csv('my.csv', na_values=['n/a'])

edited May 13 '15 at 07:46

answered May 12 '15 at 14:07

danza

11,511
8
40
47

1

Integer values are normally stored in [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement) format, which uses up all available bit patterns on actual numbers. There are none left over to signify NaN. By contrast, most floating points are done in [IEEE format](https://en.wikipedia.org/wiki/IEEE_floating_point), which sets aside bit patterns for infinity and NaN. Note that [floating point values are not precise](http://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate). – Kevin May 12 '15 at 15:13
1

Note that the `dtype={ 'my_column': np.float64 }` is not needed, this should be the result automatically (after you set the correct na_values) – joris May 12 '15 at 22:27
@joris you are right, thanks. it was necessary in order to get the value of `na_values` though. I will edit my answer – danza May 13 '15 at 07:44

How to gracefully fallback to `NaN` value while reading integers from a CSV with Pandas?

1 Answers1

Linked