7

While using read_csv with Pandas, if i want a given column to be converted to a type, a malformed value will interrupt the whole operation, without an indication about the offending value.

For example, running something like:

import pandas as pd
import numpy as np


df = pd.read_csv('my.csv', dtype={ 'my_column': np.int64 })

Will lead to a stack trace ending with the error:

ValueError: cannot safely convert passed user dtype of <i8 for object dtyped data in column ...

If i had the row number, or the offending value in the error message, i could add it to the list of known NaN values, but this way there is nothing i can do.

Is there a way to tell the parser to ignore failures and return a np.nan in that case?

Post Scriptum: Funnily enough, after parsing without any type suggestion (no dtype argument), d['my_column'].value_counts() seems to infer the dtype right and put np.nan correctly automatically, even though the actual dtype for the series is a generic object which will fail on almost every plotting and statistical operation

danza
  • 11,511
  • 8
  • 40
  • 47
  • 1
    You're explicitly specifying the dtype so it's expecting all values for that column to conform to that type, if you know the values to treat as `NaN` you can pass these to `na_values` however it looks like it's better to just let `read_csv` guess but this will mean that the dtype will be `float64` as `NaN` cannot be represented in `int64` – EdChum May 12 '15 at 12:13
  • 2
    Does the `value_counts` indicate some value that would not be convertible to a number? (so that would cause the `object` dtype) – joris May 12 '15 at 12:13
  • thanks, your comments led me to the solution, which i posted as answer. you are free to post your alternative answers if you want to add something – danza May 12 '15 at 14:09

1 Answers1

8

Thanks to the comments i realised that there is no NaN for integers, which was very surprising to me. Thus i switched to converting to float:

import pandas as pd
import numpy as np


df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 })

This gave me an understandable error message with the value of the failing conversion, so that i could add the failing value to the na_values:

df = pd.read_csv('my.csv', dtype={ 'my_column': np.float64 }, na_values=['n/a'])

This way i could finally import the CSV in a way which works with visualisation and statistical functions:

>>>> df['session_planned_os'].dtype
dtype('float64')

Once you are able to spot the right na_values, you can remove the dtype argument from read_csv. Type inference will now happen correctly:

df = pd.read_csv('my.csv', na_values=['n/a'])
danza
  • 11,511
  • 8
  • 40
  • 47
  • 1
    Integer values are normally stored in [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement) format, which uses up all available bit patterns on actual numbers. There are none left over to signify NaN. By contrast, most floating points are done in [IEEE format](https://en.wikipedia.org/wiki/IEEE_floating_point), which sets aside bit patterns for infinity and NaN. Note that [floating point values are not precise](http://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate). – Kevin May 12 '15 at 15:13
  • 1
    Note that the `dtype={ 'my_column': np.float64 }` is not needed, this should be the result automatically (after you set the correct na_values) – joris May 12 '15 at 22:27
  • @joris you are right, thanks. it was necessary in order to get the value of `na_values` though. I will edit my answer – danza May 13 '15 at 07:44