0

I have a set of chemical formulae (and some other properties) saved in a csv file. One of these formulae is NaN, an unstable nitride. However, pandas identifies this as a missing value when loading from a csv file. Below is a simple reproducible example.

df = pd.DataFrame({'formula': ['BaO', 'NaN', 'NaN3']})

>>> df
  formula
0     BaO
1     NaN
2    NaN3

Let's get the data type of each of these formulae.

for idx, row in df.iterrows():
    print(type(row.formula))

<class 'str'>
<class 'str'>
<class 'str'>

This is fine. Now, we save this dataframe to a csv file and reload.

df.to_csv('data.csv', index=False)
df_csv = pd.read_csv('data.csv') # same df loaded from csv

>>> df_csv
  formula
0     BaO
1     NaN
2    NaN3

df_csv looks identical to df, except when I check the data type of these formulae, I find NaN is identified as a missing numerical data point (np.nan).

for idx, row in df_csv.iterrows():
    print(type(row.formula))

<class 'str'>
<class 'float'>
<class 'str'>

This produces errors during my further processing steps. I don't want to remove the compound NaN from the database. How do I make sure NaN is not identified as a missing value, but as a string when loading data from a csv file?

I have tried df_csv['formula']= df_csv['formula'].astype(str) but this converts NaN to nan.

Achintha Ihalage
  • 2,310
  • 4
  • 20
  • 33
  • 1
    Does this answer your question? [Import pandas dataframe column as string not int](https://stackoverflow.com/questions/13293810/import-pandas-dataframe-column-as-string-not-int) – JeffUK Jun 23 '21 at 12:08
  • 1
    This is will likely work - https://stackoverflow.com/questions/33952142/prevent-pandas-from-interpreting-na-as-nan-in-a-string – Lewis Taylor Jun 23 '21 at 12:10
  • @JeffUK I just tried this. The accepted answer does not work in this case. – Achintha Ihalage Jun 23 '21 at 12:11

2 Answers2

2

Set na_filter to False when using pandas.read_csv usage example

import io
import pandas as pd
csv_data = io.StringIO("formula\nBaO\nNaN\nNaN3\n")
df = pd.read_csv(csv_data, na_filter=False)
print(df['formula'].notna())

output

0    True
1    True
2    True
Name: formula, dtype: bool

For brevity sake I used io.StringIO providing file with content same as argument used in io.StringIO will give same result.

Daweo
  • 31,313
  • 3
  • 12
  • 25
2

As per pandas documentation for read_csv, 'NaN' is one of default missing value indicators.

If you're sure there are no missing values in your csv file, you could simply pass an argument na_filter = False to your read_csv() call to stop missing value parse.

Otherwise, you could use keep_default_na = False to exclude the default values and specify your own with na_values parameter.

dm2
  • 4,053
  • 3
  • 17
  • 28