14

Consider the following three DataFrame's:

df1 = pd.DataFrame([[1,2],[4,3]])
df2 = pd.DataFrame([[1,.2],[4,3]])
df3 = pd.DataFrame([[1,'a'],[4,3]])

Here are the types of the second column of the DataFrame's:

In [56]: map(type,df1[1])
Out[56]: [numpy.int64, numpy.int64]

In [57]: map(type,df2[1])
Out[57]: [numpy.float64, numpy.float64]

In [58]: map(type,df3[1])
Out[58]: [str, int]

In the first case, all int's are casted to numpy.int64. Fine. In the third case, there is basically no casting. However, in the second case, the integer (3) is casted to numpy.float64; probably since the other number is a float.

How can I control the casting? In the second case, I want to have either [float64, int64] or [float, int] as types.

Workaround:

Using a callable printing function there can be a workaround as showed here.

def printFloat(x):
    if np.modf(x)[0] == 0:
        return str(int(x))
    else:
        return str(x)
pd.options.display.float_format = printFloat
Dror
  • 12,174
  • 21
  • 90
  • 160
  • Nice notebook! I think that is a very reasonable solution and good use of the `float_format`. – joris Dec 09 '14 at 15:28
  • Thanks! Can you suggest any improvement(s) to `printFloat`? – Dror Dec 09 '14 at 21:54
  • Maybe that just using `x % 1` also works instead of `np.modf`, and is faster, although I don't think that speed will be an issue (it are always a limited number of items that are printed). – joris Dec 09 '14 at 22:39

1 Answers1

16

The columns of a pandas DataFrame (or a Series) are homogeneously of type. You can inspect this with dtype (or DataFrame.dtypes):

In [14]: df1[1].dtype
Out[14]: dtype('int64')

In [15]: df2[1].dtype
Out[15]: dtype('float64')

In [16]: df3[1].dtype
Out[16]: dtype('O')

Only the generic 'object' dtype can hold any python object, and in this way can also contain mixed types:

In [18]: df2 = pd.DataFrame([[1,.2],[4,3]], dtype='object')

In [19]: df2[1].dtype
Out[19]: dtype('O')

In [20]: map(type,df2[1])
Out[20]: [float, int]

But this is really not recommended, as this defeats the purpose (or at least the performance) of pandas.

Is there a reason you specifically want both ints and floats in the same column?

joris
  • 133,120
  • 36
  • 247
  • 202
  • Well, trivial reason. Some of the rows can be represented by `int`'s and some only by `float`'s. Can a transposed version of the table serve as a solution? – Dror Dec 08 '14 at 16:28
  • Possibly, but then the occurrence of ints/floats on different rows should happen in the same column. But still, why not represent then all data as floats? (memory issue?) – joris Dec 08 '14 at 16:31
  • I was taught that if you can represent something as an `int`, then don't use `float`. So memory is one thing, beauty of code is the second, and printing the data. If `int`'s are represented as `float`'s, then when `print`'ed, there are annoying trailing `.00` – Dror Dec 08 '14 at 16:38
  • 3
    "I was taught that if you can represent something as an int, then don't use float" -> That is certainly true in general, but not anymore in numpy (scientific python) land when you want to put that data in the same array (or Series in this case) and do performant analysis on it. And if you are concerned about memory, it may be better to investigate if you need int64/float64, as maybe int32/float32 can be enough. – joris Dec 08 '14 at 16:42
  • And what about the `print`'ing? Is there a pythonic way to prettify it? – Dror Dec 08 '14 at 16:47
  • Take a look to [numpy's print options](http://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html) or a related post in [here](http://stackoverflow.com/a/2891805/764322) – Imanol Luengo Dec 08 '14 at 19:00
  • But having automatically a different print format rule within one column will be difficult. Is it just for printing in an interactive session? Or for another application? – joris Dec 08 '14 at 19:07
  • A single column in my report holds many integer numbers and only few floats. This behavior is rather annoying both due to space efficiency and readability of the (interactive) output. – Dror Dec 09 '14 at 12:36
  • Note that the default dtypes used in pandas for int and float (int64, float64) take the same amount of memory. And if the output formatting is important, I think you should question if putting them in the same column, or using a DataFrame for this, is the best thing to do. – joris Dec 09 '14 at 12:52
  • What could be an alternative to `DataFrame`? It seems to be a very convenient way to "store" data in tables inside Python – Dror Dec 09 '14 at 12:55
  • 1
    Difficult to say without knowing the exact application. It will depend on which features of pandas you use, how big your dataset is, .. But sometimes it is less overhead using dicts/lists. But to be honest, in most cases I would use pandas, but then you have to live with the one column-one type formatting issue (or write a custom print function where format each value separately depending in its value) – joris Dec 09 '14 at 12:58
  • Can you please give me a lead, so I could read further on *custom print functions*? – Dror Dec 09 '14 at 13:05
  • Aha, I think you have figured it out yourself, seems like a good solution! – joris Dec 09 '14 at 15:31