Why does function behavior used within pandas apply change?

Question

I cannot figure out why a simple function:

def to_integer(value):
    if value == "":
        return None
    return int(value)

changes values from str to int only if there's no empty string "" in the dataframe, i.e. only if no value is to be returned as None.

If I go:

type(to_integer('1')) == int

returns True.

Now, using apply and to_integer with df1:

df1 = pd.DataFrame(['1', '2', '3'], columns=['integer'])
result = df1['integer'].apply(to_integer)

gives column of integers (np.int64).

But if I apply it to this df2:

df2 = pd.DataFrame(['1', '', '3'], columns=['integer'])
result = df2['integer'].apply(to_integer)

it returns a column of floats (np.float64).

Isn't it possible to have a dataframe with integers and None at the same time?

I use Python 3.3 and Pandas 0.12.

as far as a I know, it's not possible to have int `NaN` value in numpy, take a look here - http://stackoverflow.com/questions/12708807/numpy-integer-nan — Roman Pekar, Dec 03 '13 at 12:27

Woody Pride · Accepted Answer · 2013-12-03T12:41:55.133

You are exactly right, it is not possible to have a series of ints and np.nan values.

The way that numpy implements missing values is as np.float64

http://pandas.pydata.org/pandas-docs/dev/missing_data.html.

The relevant part of the documentation is as follows:

"While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules” when reindexing will cause missing data to be introduced into, say, a Series or DataFrame. Here they are:

`data type  Cast to`
`integer    float`
`boolean    object`
`float  no cast`
`object no cast`

Why does function behavior used within pandas apply change?

1 Answers1