0

If I use the Pandas read_csv() function, elements of short rows are mapped to NaN by default. I would like to suppress that mapping while interpreting NA as NaN. I'm mostly interested in file truncation as a result of transmission problems, but short rows in the middle of the file should feature the same missing" value. I tried messing around with na_filter=False and keep_default_na=False, and while each seemed to map empty cells to the empty string, neither mapped the string NA to NaN.

Is there a way to have my cake (NA => NaN) and eat it too (missing values not mapped to NaN)? (I'm using Pandas 0.22.0 w/ Python 3.6.)

Example:

col1,col2,col3,col4
1,2,NA,4
4,5
12

Assume the file has been truncated, so the characters "12" are the last in the file (no EOF). With na_filter and keep_default_na at their default values of True, the resulting values are

1,2,NaN,4
4,5,NaN,NaN
12,NaN,NaN,NaN

If I set either to False, I get

1,2,NA,4
4,5,,
12,,,

I would like to find some way to get a NaN out of the third column of the first row without also mapping the missing values to NaN.

smontanaro
  • 1,537
  • 3
  • 15
  • 26
  • 1
    You need to give more details, e.g., example data, expected output, attempts at a solution **not** in paragraph form. – Rushabh Mehta Jul 31 '18 at 13:14

1 Answers1

1

By default, Pandas will interpret the following values as NaN (from the docs):

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

In order to avoid NA being parsed as a null value, you need to set keep_default_na=False and specify the na_values directly.

To address your second problem, in order to avoid Pandas throwing errors when reading in a file with a variable number of columns, you need to specify the columns names. Putting it all together:

# new null values, removing NA from the list
new_na_values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']

# read in the file
df = pd.read_csv("<path to file>.csv", keep_default_na=False, na_values=new_na_values, names=["col1", "col2", "col3", "col4"])

EDIT: It looks like the most recent version of Pandas can now read in files with a variable number of columns without explicitly setting the column names (as suggested here, for example). So, depending on your version of Pandas, you may not need the names argument.

sundance
  • 2,905
  • 4
  • 21
  • 31
  • Thanks. Setting na_values did what I needed. Wasn't aware I had a second problem. Pandas hasn't raised any exceptions. – smontanaro Jul 31 '18 at 13:59
  • Yep, just updated my `pandas` version and now it reads files with variable number of columns fine--glad they fixed this. Updated the answer – sundance Jul 31 '18 at 14:21