Pandas Changing the format of NaN values when saving to CSV

Question

I am working with a df and using numpy to transform data - including setting blanks (or '') to NaN. But when I write the df to csv - the output contains the string 'nan' as oppose to being NULL.

I have looked around but can't find a workable solution. Here's the basic issue:

df
index x    y   z
0     1   NaN  2
1     NaN  3   4

CSV output:

index x    y   z
0     1   nan  2
1     nan  3   4

I have tried a few things to set 'nan' to NULL but the csv output results in a 'blank' rather than NULL:

dfDemographics = dfDemographics.replace('nan', np.NaN)
dfDemographics.replace(r'\s+( +\.)|#', np.nan, regex=True).replace('', 
np.nan)
dfDemographics = dfDemographics.replace('nan', '')  # of course, this wouldn't work, but tried it anyway.

Any help would be appreciated.

cs95 · Answer 1 · 2018-06-16T20:37:57.597

62

Pandas to the rescue, use na_rep to fix your own representation for NaNs.

df.to_csv('file.csv', na_rep='NULL')

file.csv

,index,x,y,z
0,0,1.0,NULL,2
1,1,NULL,3.0,4

edited Jun 16 '18 at 20:37

answered Jun 16 '18 at 19:21

cs95

379,657
97
704
746

Not exactly what I need - this replaces NA in df with a string of 'NULL'. I realize my post may not be clear - by NULL I meant - nothing. That is the NaN, in the csv output, should be just nothing - no strings, no data, including spaces. – Jerry Jun 16 '18 at 19:42
5

@Jerry What about `df.to_csv('file.csv', na_rep='')`? – cs95 Jun 16 '18 at 19:51
1

@Jerry That's the default behaviour of `to_csv` actually, you are probably doing something else wrong but I don't see how it is possible to get the output from the input you posted. https://imgur.com/3xNP5aG – ayhan Jun 16 '18 at 19:55
@user2285236 - I believe you are correct - this shouldn't happen; something else is going on in the script, most likely with my use of numpy (np.select). Will keep at it. – Jerry Jun 16 '18 at 20:30
@Jerry "something else is probably going on" and what might that be? Your question as it stands is unanswerable. Should we close it? – cs95 Jun 16 '18 at 20:32
@CallCentreExecutive - I have investigated without success to resolve this issue. The question remains unanswered. Per your note, you are welcome to close the question. – Jerry Jun 25 '18 at 15:08
in my case this quoting=csv.QUOTE_NONNUMERIC was causing the quoting of the NULL – Conor Jul 29 '20 at 13:19

score 3 · Answer 2 · answered Jun 16 '18 at 19:51

3

Using df.replace may help -

df = df.replace(np.nan, '', regex=True)
df.to_csv("df.csv", index=False)

(This sets all the null values to '' i.e empty string.)

answered Jun 16 '18 at 19:51

Kranthi Kiran

121
1
6

Thanks - but no luck here either. Something else is probably going on. – Jerry Jun 16 '18 at 20:31
Can you please elaborate what you want on the end .csv file for Nan values, like do you want them to be null or what? I'm not able to understand. @Jerry – Kranthi Kiran Jun 16 '18 at 20:35
the NaN in csv output should be just that - nothing. For some reason, some of the columns in the df with NaN are being written as either 'nan' (as a string) or just '' (empty string). The problem is most likely arising during the data transformation process, probably with my use of np.select. – Jerry Jun 16 '18 at 21:02

score 1 · Answer 3 · answered Jan 15 '20 at 14:02

In my situation, the culprit was np.where. When the data types of the two return elements are different, then your np.NaN will be converted to a nan.

It's hard (for me) to see exactly what's going on under the hood, but I suspect this might be true for other Numpy array methods that have mixed types.

A minimal example:

import numpy as np
import pandas as pd

seq = [1, 2, 3, 4, np.NaN]
same_type_seq = np.where("parrot"=="dead", 0, seq)
diff_type_seq = np.where("parrot"=="dead", "spam", seq)

pd.Series(seq).to_csv("vanilla_nan.csv", header=False) # as expected, last row is blank
pd.Series(same_type_seq).to_csv("samey_nan.csv", header=False) # also, blank
pd.Series(diff_type_seq).to_csv("nany_nan.csv", header=False) # nan instead of blank

So how to get round this? I'm not too sure, but as a hacky workaround for small datasets, you can replace NaN in your original sequence with a token string and then replace it back to np.NaN

repl = "missing"
hacky_seq = np.where("parrot"=="dead", "spam", [repl if np.isnan(x) else x for x in seq])
pd.Series(hacky_seq).replace({repl:np.NaN}).to_csv("hacky_nan.csv", header=False)

Everything that is not a string is coerced to a string to avoid having mixed types in the array. — cs95, Feb 01 '20 at 09:51

score 1 · Answer 4 · answered Mar 24 '22 at 03:30

1

Based on gherka's and Kranthi Kiran's answers:

df = df.replace("nan", '', regex=True)
df.to_csv("df.csv", index=False)

answered Mar 24 '22 at 03:30

glisu

1,027
10
20

score 0 · Answer 5 · answered Mar 27 '19 at 18:05

User @coldspeed illustrates how to replace nan values with NULL when save pd.DataFrame. In case, for data analysis, one is interested in replacing the "NULL" values in pd.DataFrame with np.NaN values, the following code will do:

import numpy as np, pandas as pd

# replace NULL values with np.nan
colNames = mydf.columns.tolist()
dfVals = mydf.values
matSyb = mydf.isnull().values
dfVals[matSyb] = np.NAN

mydf = pd.DataFrame(dfVals, columns=colNames)    
#np.nansum(mydf.values, axis=0 )
#np.nansum(dfVals, axis=0 )

Pandas Changing the format of NaN values when saving to CSV

5 Answers5

Linked