Detecting CSV file truncation when NA is a valid value

Question

If I use the Pandas read_csv() function, elements of short rows are mapped to NaN by default. I would like to suppress that mapping while interpreting NA as NaN. I'm mostly interested in file truncation as a result of transmission problems, but short rows in the middle of the file should feature the same missing" value. I tried messing around with na_filter=False and keep_default_na=False, and while each seemed to map empty cells to the empty string, neither mapped the string NA to NaN.

Is there a way to have my cake (NA => NaN) and eat it too (missing values not mapped to NaN)? (I'm using Pandas 0.22.0 w/ Python 3.6.)

Example:

col1,col2,col3,col4
1,2,NA,4
4,5
12

Assume the file has been truncated, so the characters "12" are the last in the file (no EOF). With na_filter and keep_default_na at their default values of True, the resulting values are

1,2,NaN,4
4,5,NaN,NaN
12,NaN,NaN,NaN

If I set either to False, I get

1,2,NA,4
4,5,,
12,,,

I would like to find some way to get a NaN out of the third column of the first row without also mapping the missing values to NaN.

You need to give more details, e.g., example data, expected output, attempts at a solution **not** in paragraph form. — Rushabh Mehta, Jul 31 '18 at 13:14

sundance · Accepted Answer · 2018-07-31T14:21:21.717

By default, Pandas will interpret the following values as NaN (from the docs):

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

In order to avoid NA being parsed as a null value, you need to set keep_default_na=False and specify the na_values directly.

To address your second problem, in order to avoid Pandas throwing errors when reading in a file with a variable number of columns, you need to specify the columns names. Putting it all together:

# new null values, removing NA from the list
new_na_values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']

# read in the file
df = pd.read_csv("<path to file>.csv", keep_default_na=False, na_values=new_na_values, names=["col1", "col2", "col3", "col4"])

EDIT: It looks like the most recent version of Pandas can now read in files with a variable number of columns without explicitly setting the column names (as suggested here, for example). So, depending on your version of Pandas, you may not need the names argument.

Thanks. Setting na_values did what I needed. Wasn't aware I had a second problem. Pandas hasn't raised any exceptions. — smontanaro, Jul 31 '18 at 13:59
Yep, just updated my `pandas` version and now it reads files with variable number of columns fine--glad they fixed this. Updated the answer — sundance, Jul 31 '18 at 14:21

Detecting CSV file truncation when NA is a valid value

1 Answers1