2

I am struggling to filter my dataframe, so that I have only rows with whole numbers (like 21.00). I saw one similar QA (Creating new column in pandas df populated by True,False depending on whether another column is a whole number), but it is not I want to achieve. I tried float.is_integer(), but this is not a method of Series and it would have to be applied element-wise with for loop.

In my dataframe I have a columns like this:

index  value
0      43.00 
1      23.47
2       5.31 
3      349.00

and I want to extract only rows that contain whole numbers, so in the case above I want only rows with values: 43.00 and 349.00.

How can it be done without using for loops or adding the new column with indicator variable if the value is a whole number?

My dataframe has tens of millions of rows so I'd rather avoid using loops or adding another column if possible.

jpp
  • 159,742
  • 34
  • 281
  • 339
Bonzogondo
  • 143
  • 10

1 Answers1

3

You can use a Boolean series to filter a dataframe:

res = df[df['value'].map(lambda x: x.is_integer())]

print(res)

   index  value
0      0   43.0
3      3  349.0

For performance, you may wish to compare a series against an integer version of itself:

res = df[df['value'] == df['value'].astype(int)]

Performance benchmarking

The cost is dominated by construction of the Boolean series.

df2 = pd.concat([df]*100000)

%timeit df2['value'].values % 1 == 0.0              # 20.8 ms per loop
%timeit df2['value'] == df2['value'].astype(int)    # 2.59 ms per loop
%timeit df2['value'].map(lambda x: x.is_integer())  # 195 ms per loop
%timeit ~(df2['value'] % 1).astype(bool)            # 23.3 ms per loop
%timeit df2['value'] % 1 == 0.0                     # 21.8 ms per loop

Versions:

sys.version     # '3.6.0'
pd.__version__  # '0.19.2'
np.__version__  # '1.11.3'
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Actually, think `df2['value'].values % 1 == 0.0` maybe faster. – Scott Boston Jun 20 '18 at 17:58
  • 1
    @ScottBoston, Hmm, unfortunately I don't see that vs comparing to `pd.Series.astype(int)`. I've updated with the additional timing. – jpp Jun 20 '18 at 18:02
  • Pandas object carry overhead. Using the numpy array is almost alway faster. Especially when doing comparison like this. Pandas just handles NaN and other things more gracefully, that grace has cost though. – Scott Boston Jun 20 '18 at 18:06
  • 1
    @ScottBoston, Ah yes, I agree it's marginally (~5%) faster than the Pandas version. But it's still ~9x slower than `df2['value'] == df2['value'].astype(int)`. – jpp Jun 20 '18 at 18:07
  • Ah.. yes... I see. .astype(int) is optimal. – Scott Boston Jun 20 '18 at 18:10