I have a set of chemical formulae (and some other properties) saved in a csv file. One of these formulae is NaN, an unstable nitride. However, pandas identifies this as a missing value when loading from a csv file. Below is a simple reproducible example.
df = pd.DataFrame({'formula': ['BaO', 'NaN', 'NaN3']})
>>> df
formula
0 BaO
1 NaN
2 NaN3
Let's get the data type of each of these formulae.
for idx, row in df.iterrows():
print(type(row.formula))
<class 'str'>
<class 'str'>
<class 'str'>
This is fine. Now, we save this dataframe to a csv file and reload.
df.to_csv('data.csv', index=False)
df_csv = pd.read_csv('data.csv') # same df loaded from csv
>>> df_csv
formula
0 BaO
1 NaN
2 NaN3
df_csv
looks identical to df
, except when I check the data type of these formulae, I find NaN is identified as a missing numerical data point (np.nan
).
for idx, row in df_csv.iterrows():
print(type(row.formula))
<class 'str'>
<class 'float'>
<class 'str'>
This produces errors during my further processing steps. I don't want to remove the compound NaN from the database. How do I make sure NaN is not identified as a missing value, but as a string when loading data from a csv file?
I have tried df_csv['formula']= df_csv['formula'].astype(str)
but this converts NaN to nan.