5

Problem

I have concatenated two series of type int and the dataframe I get in return is of type float. This happens because the indices of the series are not aligned and when concatenation happens, pandas fills the gaps with NaN. However, NaN is considered a float and unfortunately converts all my ints into floats along with it.

Question

My question is, how can I fill the gaps with something else that won't convert my ints to floats?

MCV

import pandas as pd

s1 = pd.Series([1], index=['A'])
s2 = pd.Series([1], index=['B'])

print "s1 type: {} | s2 type: {}\n".format(s1.dtype, s2.dtype)

df = pd.concat([s1, s2], axis=1)
print df, "\n"
print df.dtypes

Prints:

s1 type: int64 | s2 type: int64

     0    1
A  1.0  NaN
B  NaN  1.0 

0    float64
1    float64
dtype: object
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    Well this is a personal choice in terms of do you want `0` or `-1` or whatever, after that you can do `df.astype(int)` after `fillna` but as you know `NaN` cannot be represented in integer so you have to decide what you want instead – EdChum May 21 '16 at 09:06
  • So @EdChum, you know that if I were trying to answer this question and I saw that you put what amounts to a very valid answer in the comments, I'd be discouraged to put an answer that was too similar. That's my nature, and I assume that some people share that characteristic with me... Long comment made not as long, please repeat what you just wrote in the comment as an answer so I can accept it. – piRSquared May 21 '16 at 09:20
  • I don't have this hangup about this as it's unimportant so feel free to do so – EdChum May 21 '16 at 09:23
  • 1
    I'll just end up putting the answer up myself so it's clear to others, get no points, and plastering your name all over it ;-). I like points and all, but proper credit is very important to me. – piRSquared May 21 '16 at 09:28
  • So long as you mention the credit then I don't think there's a problem, personally it's not an issue and some people may do this intentionally or unintentionally I don't care it's points, not money or food – EdChum May 21 '16 at 09:36

1 Answers1

2

Firstly the dtype conversion is due to NaN not being representable in integer so float dtype is selected.

Secondly this then becomes a personal choice as to what to do when this occurs and it depends on you, there is no correct choice.

For instance we could just use fillna with an arbitrary value such as 0 or -1 and then we can cast the type back using astype(int):

In [21]:
df.fillna(0).astype(int)

Out[21]:
   0  1
A  1  0
B  0  1

But this may not be what you want, you may decide to remove these rows by using dropna but this may mean you lose valuable information which could be critical if you were doing some kind of machine learning or other analysis.

So you may decide that you want to set those columns/rows to either the min/max/mean or median value but this can have serious consequences if the column values have dependencies on other columns, for instance we set all missing values to the min/max but then that value biases the predictive model as it loses entropy/information because if you have a significant number of missing values then your data becomes biased to min/max. Personally in those situations I found mean to work fine.

Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562