Get pandas.read_csv to read empty values as empty string instead of nan

Question

I'm using the pandas library to read in some CSV data. In my data, certain columns contain strings. The string "nan" is a possible value, as is an empty string. I managed to get pandas to read "nan" as a string, but I can't figure out how to get it not to read an empty value as NaN. Here's sample data and output

One,Two,Three
a,1,one
b,2,two
,3,three
d,4,nan
e,5,five
nan,6,
g,7,seven

>>> pandas.read_csv('test.csv', na_values={'One': [], "Three": []})
    One  Two  Three
0    a    1    one
1    b    2    two
2  NaN    3  three
3    d    4    nan
4    e    5   five
5  nan    6    NaN
6    g    7  seven

It correctly reads "nan" as the string "nan', but still reads the empty cells as NaN. I tried passing in str in the converters argument to read_csv (with converters={'One': str})), but it still reads the empty cells as NaN.

I realize I can fill the values after reading, with fillna, but is there really no way to tell pandas that an empty cell in a particular CSV column should be read as an empty string instead of NaN?

Note the simpler, answer using the more recent option `keep_default_na` below. — nealmcb, May 24 '20 at 17:08
`pd.read_csv( sourceObj, dtype='string' )` , no additional parameters are needed. Pandas will cast all rows string, and empty values will be set as empty string '' — dank8, Mar 03 '23 at 01:43

score 210 · Answer 1 · answered May 07 '17 at 14:55

210

I was still confused after reading the other answers and comments. But the answer now seems simpler, so here you go.

Since Pandas version 0.9 (from 2012), you can read your csv with empty cells interpreted as empty strings by simply setting keep_default_na=False:

pd.read_csv('test.csv', keep_default_na=False)

This issue is more clearly explained in

More consistent na_values handling in read_csv · Issue #1657 · pandas-dev/pandas

That was fixed on on Aug 19, 2012 for Pandas version 0.9 in

BUG: more consistent na_values #1657 · pandas-dev/pandas@d9abf68

answered May 07 '17 at 14:55

nealmcb

12,479
7
66
91

16

This is clearly the best answer, it should be designated as first solution. Thanks @nealmcb – dzof31 Jul 26 '19 at 12:32
6

I wish this was the default, the number of times I've had to google for this answer.... – David Waterworth Aug 16 '21 at 23:24

Wes McKinney · Accepted Answer · 2012-06-25T22:35:25.977

71

I added a ticket to add an option of some sort here:

https://github.com/pydata/pandas/issues/1450

In the meantime, result.fillna('') should do what you want

EDIT: in the development version (to be 0.8.0 final) if you specify an empty list of na_values, empty strings will stay empty strings in the result

edited Jun 25 '12 at 22:35

answered Jun 12 '12 at 21:33

Wes McKinney

101,437
32
142
108

15

[Documentation for `DataFrame.fillna`.](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) Try `result.fillna('', inplace=True)`. Otherwise it creates a copy of the dataframe. – Sergey Orshanskiy Sep 05 '14 at 22:48
1

sorry to resurrect such an old answer, but did this ever happen? As far as I can tell from [this GitHub PR](https://github.com/pydata/pandas/pull/1522) it was closed without ever being merged, and I'm not seeing the requested behavior in pandas version 0.14.x – drammock Sep 10 '15 at 20:52
11

[Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for read_csv now offers both `na_values` (list or dict indexed by columns) and `keep_default_na` (bool). The `keep_default_na` value indicates whether pandas' default NA values should be replaced or appended to. The OP's code doesn't work currently just because it's missing this flag. For this example, you could use `pandas.read_csv('test.csv',na_values=['nan'], keep_default_na=False)`. – Michael Delgado Sep 30 '15 at 20:17
@delgadom Thanks for leading me to `keep_default_na`. But note that he doesn't want 'nan' to be treated as a default either. I've added a more complete explanation as a new answer. – nealmcb May 07 '17 at 14:55
3

ran into this again. the fix is easy (the best answer is as below to put `keep_default_na=False`) but pandas default behaviour on this is IMO bad. if for some reason pandas read_csv infers a column is not numeric it should not automatically change empty strings to NaN. – pietroppeter Aug 27 '20 at 08:38
Those answers altering the result are very common but neglect the fact that you may not know the data structure upfront, eg a software processing custom CSV. I need pandas to be able to detect my column types as smartly as possible as I cannot fix types afterward. – Eric Burel Oct 20 '21 at 11:37

score 16 · Answer 3 · edited Aug 26 '21 at 06:50

16

We have a simple argument in Pandas read_csv() for this:

Use:

df = pd.read_csv('test.csv', na_filter= False)

edited Aug 26 '21 at 06:50

buhtz

10,774
18
76
149

answered Jul 05 '19 at 12:59

Sundeep

169
1
4

2

It looks like the OP _does_ want to use `na_values` to recognize "nan", but turning `na_filter` off entirely would defeat that. Thus my answer with `keep_default_na=False`. – nealmcb Oct 18 '19 at 14:54
1

Be careful, the `na_filter=False` can change your columns type to object – Ricardo Mutti Sep 07 '21 at 21:41
Per na_filter=False "changing column to type Object": seems to me that Pandas default is to set column as object if the other data elements of the column are strings as opposed to things that are clearly numbers (e.g., column 'One' and 'Three' in the question. – wiseass Feb 04 '23 at 13:59

buhtz · Answer 4 · 2021-08-26T06:48:18.853

What pandas defines by default as missing value while read_csv() can be found here.

import pandas
default_missing = pandas._libs.parsers.STR_NA_VALUES
print(default_missing)

The output

{'', '<NA>', 'nan', '1.#QNAN', 'NA', 'null', 'n/a', '-nan', '1.#IND', '#N/A N/A', 'N/A', 'NULL', 'NaN', '-1.#IND', '-1.#QNAN', '#NA', '#N/A', '-NaN'}

With that you can do an opt-out.

import pandas
default_missing = pandas._libs.parsers.STR_NA_VALUES
default_missing = default_missing.remove('')
default_missing = default_missing.remove('na')

with open('test.csv', 'r') as csv_file:
    pandas.read_csv(csv_file, na_values=default_missing)

A minor typo error, replace `a_values` by `na_values` – Bruno Adelé Aug 25 '21 at 18:18 — Bruno Adelé, Aug 25 '21 at 18:18

score 3 · Answer 5 · answered Nov 30 '21 at 16:17

3

If you want to keep the empty strings for just one column, define str as the column converter (dtypes won't work):

pd.read_csv('test.csv', converters={'column_name': str})

answered Nov 30 '21 at 16:17

ronkov

1,263
9
14

score 1 · Answer 6 · answered Mar 03 '23 at 01:51

1

pd.read_csv( sourceObj, dtype='string')

no additional parameters needed.

Each column type is python primitive string and empty values become empty string ''.

Version: Pandas v1.5

answered Mar 03 '23 at 01:51

dank8

361
4
20

Get pandas.read_csv to read empty values as empty string instead of nan

6 Answers6

Linked

Related