9

I am new here, ideally i would have commented this on the question from where i learned this usage of idxmax :

I used same approach and below is my code

df = pd.DataFrame(np.arange(16).reshape(4,4),columns=["A","B","C","D"],index=[0,1,2,3])

As soon as i use df[(df>6)] on this df these int values change to float?

        A   B   C   D
0   NaN NaN NaN NaN
1   NaN NaN NaN 7.0
2   8.0 9.0 10.0    11.0
3   12.0    13.0    14.0    15.0

Why does pandas do that? Also, i read somewhere i could use dtype=object on series , but are there some other ways to avoid such thing?

Avij
  • 684
  • 7
  • 17
  • 1
    cause `np.nan` is float https://stackoverflow.com/questions/12708807/numpy-integer-nan – BENY Nov 07 '17 at 05:17
  • 1
    @Avij - not anymore, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. [pandas 0.24.x release notes](https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support) – mork Jan 25 '19 at 17:19

3 Answers3

5

If you do want to have the int look like

df.astype(object).mask(df<=6)
Out[114]: 
     A    B    C    D
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN    7
2    8    9   10   11
3   12   13   14   15

You can looking for more information at here, and here

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object arrays instead.

More information about astype(object)

df.astype(object).mask(df<=6).applymap(type)
Out[115]: 
                 A                B                C                D
0  <class 'float'>  <class 'float'>  <class 'float'>  <class 'float'>
1  <class 'float'>  <class 'float'>  <class 'float'>    <class 'int'>
2    <class 'int'>    <class 'int'>    <class 'int'>    <class 'int'>
3    <class 'int'>    <class 'int'>    <class 'int'>    <class 'int'>
BENY
  • 317,841
  • 20
  • 164
  • 234
4

The limitation is mostly with Numpy.

  • Numpy's ndarray can only be of a single type.
  • There does not exist an integer type null value.

So we end up with a dilemma when we do df[df > 6]. What is going to happen is Pandas is going to return a dataframe with values equal to df where df > 6 and null otherwise. But like I said, there isn't an integer null value. So we have a choice to make.

  1. Use None or np.nan for null values while making the entire ndarray of dtype==object
  2. Use np.nan as our null and make the entire array of dtype==float

Pandas chooses to make the arrays into float because keeping the values numeric will keep many of the advantages that come with numeric dtypes and their calculations.


Option 1
Use a fill value and pd.DataFrame.where

df.where(df > 6, -1)

    A   B   C   D
0  -1  -1  -1  -1
1  -1  -1  -1   7
2   8   9  10  11
3  12  13  14  15

Option 2
pd.DataFrame.stack and loc
By converting to a single dimension, we aren't forced to fill missing values in the rectangular grid with nulls.

df.stack().loc[lambda x: x > 6]

1  D     7
2  A     8
   B     9
   C    10
   D    11
3  A    12
   B    13
   C    14
   D    15
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

In previous versions (<0.24.0) pandas indeed converted any int columns to floats, if even a single NaN was present. But not anymore, since Optional Nullable Integer Support is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

mork
  • 1,747
  • 21
  • 23