128

The pandas read_csv() method interprets 'NA' as nan (not a number) instead of a valid string.

In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.

sample.tsv (tab delimited)

PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118

read_sample.py

import pandas as pd

df = pd.read_csv(
    'sample.tsv',
    sep='\t',
    encoding='utf-8',
)

for df_tuples in df.itertuples(index=True):
    print(df_tuples)

output

(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)

Additional Information

Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\'' has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols) does not change the result.

An old answer to Prevent pandas from automatically inferring type in read_csv suggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.

Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."

Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.


Is there a way to capture a valid string 'NA' instead of it being converted to nan?

Community
  • 1
  • 1
Michelle Welcks
  • 3,513
  • 4
  • 21
  • 34
  • It's not that it's treating it as a number, but rather indicates that you have [missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing). I don't know what the fix is in your case, but just thought you should know. – Jeff Mercado Nov 27 '15 at 07:27
  • @JeffMercado If you try Anton's example below, but remove `keep_default_na=False`, you'll see that NaN is actually ``. – Michelle Welcks Nov 27 '15 at 08:10
  • @binarysubstrate Maybe you can consider trimming your question a bit (eg only keeping the 'simple case'), this would make the question more focused and more useful for other readers (as it is a very good question!) – joris Nov 27 '15 at 10:39
  • @joris Good suggestion. I made an edit to simplify the question and reduce the length. – Michelle Welcks Nov 27 '15 at 21:22
  • Does this answer your question? [Get pandas.read\_csv to read empty values as empty string instead of nan](https://stackoverflow.com/questions/10867028/get-pandas-read-csv-to-read-empty-values-as-empty-string-instead-of-nan) – dank8 Mar 03 '23 at 02:48

6 Answers6

122

You could use parameters keep_default_na and na_values to set all NA values by hand docs:

import pandas as pd
from io import StringIO

data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""

df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])

In [130]: df
Out[130]:
    PDB CHAIN SP_PRIMARY  RES_BEG  RES_END  PDB_BEG  PDB_END  SP_BEG  SP_END
0  5d8b     N     P60490        1      146        1      146       1     146
1  5d8b    NA     P80377      NaN      126        1      126       1     126
2  5d8b     O     P60491        1      118        1      118       1     118

In [144]: df.CHAIN.apply(type)
Out[144]:
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
Name: CHAIN, dtype: object

EDIT

All default NA values from na-values (as of pandas 1.0.0):

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

EricRobertBrewer
  • 1,750
  • 1
  • 22
  • 25
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
  • 2
    Thanks for mentioning how to tailor na-values to out requirements. I only needed to remove N/A and NA from the list. I needed to retain the rest. – nitin3685 Oct 21 '19 at 08:33
69

For me solution came from using parameter na_filter = False

df = pd.read_csv(file_, header=0, dtype=object, na_filter = False)
Matthew Coelho
  • 691
  • 5
  • 3
  • 1
    In my case, some data is missing in the last column, e.g., `100,`. The accepted answer still leaves NA there but `na_filter=False` does the trick – nos Jun 10 '22 at 13:52
17

Setting keep_default_na parameter does the trick.

Here is an example of keeping NA as string value while reading CSV file using Pandas.

data.csv:

country_name,country_code
Mexico,MX
Namibia,NA

read_data.py:

import pandas as pd
data = pd.read_csv("data.csv", keep_default_na=False)
print(data.describe())
print(data)

Output:

       country_name country_code
count             2            2
unique            2            2
top         Namibia           MX
freq              1            1

  country_name country_code
0       Mexico           MX
1      Namibia           NA

Reference:

arshovon
  • 13,270
  • 9
  • 51
  • 69
6

This approach work for me:

import pandas as pd
df = pd.read_csv('Test.csv')
co1 col2  col3  col4

a   b    c  d   e
NaN NaN NaN NaN NaN
2   3   4   5   NaN

I copied the value and created a list which are by default interpreted as NaN then comment out NA which I wanted to be interpreted as not NaN. This approach still treat other values as NaN except for NA.

 na_values = ["", 
             "#N/A", 
             "#N/A N/A", 
             "#NA", 
             "-1.#IND", 
             "-1.#QNAN", 
             "-NaN", 
             "-nan", 
             "1.#IND", 
             "1.#QNAN", 
             "<NA>", 
             "N/A", 
#              "NA", 
             "NULL", 
             "NaN", 
             "n/a", 
             "nan", 
             "null"]

df1 = pd.read_csv('Test.csv',na_values=na_values,keep_default_na=False )

      co1  col2  col3  col4
a     b     c     d     e
NaN  NA   NaN    NA   NaN
2     3     4     5   NaN
Suman Shrestha
  • 161
  • 1
  • 6
3

Building off of Anton Protopopov's answer, a clean way to minimally modify the default values (i.e. remove the values you don't want parsed as NaN and add those that you do):

from pandas._libs.parsers import STR_NA_VALUES

accepted_na_values = STR_NA_VALUES - {'NA'} | {'_'}
path = 'myexcel.xlsx'
df = pd.read_excel(path, keep_default_na=False, na_values=accepted_na_values)
Jon
  • 952
  • 1
  • 11
  • 17
2

While reading the file using pandas you can use the parameter na_filter = False or keep_default_na=False in that line

import pandas as pd

df = pd.read_csv('sample.tsv',sep='\t',encoding='utf-8',na_filter = False)

for df_tuples in df.itertuples(index=True):
    print(df_tuples)