16

When saving a Pandas DataFrame to csv, some integers are getting converted in floats. It happens where a column of floats has missing values (np.nan).

Is there a simple way to avoid it? (Especially in an automatic way - I often deal with many columns of various data types.)

For example

import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2],[3,np.nan],[5,6]],
                  columns=["a","b"],
                  index=["i_1","i_2","i_3"])
df.to_csv("file.csv")

yields

,a,b
i_1,1,2.0
i_2,3,
i_3,5,6.0

What I would like to get is

,a,b
i_1,1,2
i_2,3,
i_3,5,6

EDIT: I am fully aware of Support for integer NA - Pandas Caveats and Gotchas. The question is what is a nice workaround (especially in case if there are many other columns of various types and I do not know in advance which "integer" columns have missing values).

Piotr Migdal
  • 11,864
  • 9
  • 64
  • 86
  • 2
    Why is this an issue, there is no way to represent `NaN` for ints hence the conversion to floats. You'd have to substitute the `NaN` values with something that can be represented as an int like `0`, or convert to string and replace the `nan` string with an empty value and then export – EdChum Sep 11 '14 at 14:00
  • 1
    @EdChum I know that `NaN` are floats. Just it is annoying that there is not "missing int" (from data perspective - a missing field is a missing field; there is nothing special about missing floats). The thing is I do not want to export a missing int as `0`, but as an empty field (for some applications I do convert missing ints to `-1`, but for others it might be problematic). – Piotr Migdal Sep 11 '14 at 14:37
  • @PiotrMigdal I think your only shot here in that case is to convert to strings and fill nan with empty strings, as already suggested – Korem Sep 11 '14 at 16:04
  • I think I understand what you are looking for so I've attempted an answer. However, you might want to consider expanding the sample data to better fit your situation. I understand the limitation of the existing answer from Korem/EdChum but it actually does produce the results you asked for b/c your sample data is very simple. – JohnE Jun 29 '15 at 14:12
  • It seems that perhaps the categorical type might be part of a possible solution, since (in theory) you could have an integer mapping that includes NaNs. However, if I try that conversion it ends up with a float index which negates any practical improvement in this situation, but it seems like there is some potential there. – JohnE Jun 29 '15 at 16:52

4 Answers4

11

Using float_format = '%.12g' inside the to_csv function solved a similar problem for me. It keeps the decimals for legitimate floats with up to 12 significant digits, but drops them for ints being forced to floats by the presence of NaN's:

In [4]: df
Out[4]: 
     a    b
i_1  1    2.0
i_2  3    NaN
i_3  5.9  6.0

In [5]: df.to_csv('file.csv', float_format = '%.12g')

Output is:

   , a,  b
i_1, 1,  2
i_2, 3, 
i_3, 5.9, 6
gnelson
  • 385
  • 3
  • 6
5

This snippet does what you want and should be relatively efficient at doing it.

import numpy as np
import pandas as pd

EPSILON = 1e-9

def _lost_precision(s):
    """
    The total amount of precision lost over Series `s`
    during conversion to int64 dtype
    """
    try:
        return (s - s.fillna(0).astype(np.int64)).sum()
    except ValueError:
        return np.nan

def _nansafe_integer_convert(s):
    """
    Convert Series `s` to an object type with `np.nan`
    represented as an empty string ""
    """
    if _lost_precision(s) < EPSILON:
        # Here's where the magic happens
        as_object = s.fillna(0).astype(np.int64).astype(np.object)
        as_object[s.isnull()] = ""
        return as_object
    else:
        return s


def nansafe_to_csv(df, *args, **kwargs):
    """
    Write `df` to a csv file, allowing for missing values
    in integer columns

    Uses `_lost_precision` to test whether a column can be
    converted to an integer data type without losing precision.
    Missing values in integer columns are represented as empty
    fields in the resulting csv.
    """
    df.apply(_nansafe_integer_convert).to_csv(*args, **kwargs)

We can test this with a simple DataFrame which should cover all bases:

In [75]: df = pd.DataFrame([[1,2, 3.1, "i"],[3,np.nan, 4.0, "j"],[5,6, 7.1, "k"]]
                  columns=["a","b", "c", "d"],
                  index=["i_1","i_2","i_3"])
In [76]: df
Out[76]: 
     a   b    c  d
i_1  1   2  3.1  i
i_2  3 NaN  4.0  j
i_3  5   6  7.1  k

In [77]: nansafe_to_csv(df, 'deleteme.csv', index=False)

Which produces the following csv file:

a,b,c,d
1,2,3.1,i
3,,4.0,j
5,6,7.1,k
LondonRob
  • 73,083
  • 37
  • 144
  • 201
4

I'm expanding the sample data here to hopefully make sure this is handling the situations you are dealing with:

df = pd.DataFrame([[1.1,2,9.9,44,1.0],
                   [3.3,np.nan,4.4,22,3.0],
                   [5.5,8,np.nan,66,4.0]],
                  columns=list('abcde'),
                  index=["i_1","i_2","i_3"])

       a   b    c   d  e
i_1  1.1   2  9.9  44  1
i_2  3.3 NaN  4.4  22  3
i_3  5.5   8  NaN  66  4

df.dtypes

a    float64
b    float64
c    float64
d      int64
e    float64

I think if you want a general solution, it's going to have to be explicitly coded due to pandas not allowing NaNs in int columns. What I do below here is check for integers values (since we can't really check the type as they will have been recast to float if they contain NaNs), and if it's an integer value then convert to a string format and also convert 'NAN' to '' (empty). Of course, this is not how you want to store the integers except as a final step before outputting.

for col in df.columns:
    if any( df[col].isnull() ):
        tmp = df[col][ df[col].notnull() ]
        if all( tmp.astype(int).astype(float) == tmp.astype(float) ):
            df[col] = df[col].map('{:.0F}'.format).replace('NAN','')

df.to_csv('x.csv')

Here's the output file and also what it looks like if you read it back into pandas although the purpose of this is presumably to read it into other numerical packages.

%more x.csv

,a,b,c,d,e
i_1,1.1,2,9.9,44,1.0
i_2,3.3,,4.4,22,3.0
i_3,5.5,8,,66,4.0

pd.read_csv('x.csv')

  Unnamed: 0    a   b    c   d  e
0        i_1  1.1   2  9.9  44  1
1        i_2  3.3 NaN  4.4  22  3
2        i_3  5.5   8  NaN  66  4
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Thanks! Makes sense; though, still it coerces floats that happen to have integer values to ints (consider a column with values `[1.0, -5.0, 3.0]`). Yet, what I see is that adding a single `np.nan` changes types of all entries, so there is no possibility to recover the original. :/ In this case I am curious if it is possible to avoid column casting (e.g. having `object` type and mixed types of elements). EDIT: Seems that setting `dtype='object'` when creating a `DataFrame` or `low_memory=False` does the trick. – Piotr Migdal Jun 29 '15 at 17:05
  • @PiotrMigdal Just edited, take a look. First part of question should be fixed with addition of `if any( df[col].isnull() ):` (also added a new column). I don't understand the second part of question. Only objects can be mixed, storing numbers as objects should always be last resort though (for case like this, I guess) because numerical performance will be far worse with objects than ints/floats. – JohnE Jun 29 '15 at 17:43
2

@EdChum 's suggestion is the comment is nice, you could also use the float_format argument (see in the docs)

In [28]: a
Out[28]: 
   a   b
0  0   1
1  1 NaN
2  2   3
In [31]: a.to_csv(r'c:\x.csv', float_format = '%.0f')

Gives out:

,a,b
0,0,1
1,1,
2,2,3
Korem
  • 11,383
  • 7
  • 55
  • 72
  • In general I have many columns. I don't want to format "normal floats" as `'%.0f'`. I only want to format `int` (mixed with `np.nan`s, which are sadly floats) as `'%.0f'`. – Piotr Migdal Sep 11 '14 at 14:34