Selection of a Series in Python Pandas

Question

I am using Pandas Series to selection rows of a Series. However, I met a problem as following:

>>> q=pandas.Series([0.5,0.5,0,1,0.5,0.5])
>>> q
0    0.5
1    0.5
2    0.0
3    1.0
4    0.5
5    0.5
dtype: float64

>>> (q-0.3).abs()
0    0.2
1    0.2
2    0.3
3    0.7
4    0.2
5    0.2
dtype: float64

>>> (q-0.7).abs()
0    0.2
1    0.2
2    0.7
3    0.3
4    0.2
5    0.2
dtype: float64

>>> (q-0.3).abs() > (q-0.7).abs()          # This is I expected:
0     True                                 # False
1     True                                 # False
2    False                                 # False
3     True                                 # True
4     True                                 # False
5     True                                 # False
dtype: bool

>>> (q-0.3).abs() == (q-0.7).abs()
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

Apparently, "0.2" is not greater than "0.2"......

Why is the output different from what I expect?

score 1 · Accepted Answer · edited May 23 '17 at 11:49

1

This is a floating point problem. It is described very well in this question.

To directly answer your problem, look at the first element of your two tests. Your values are not equal.

>>> (q-0.7).abs()[1]
0.19999999999999996
>>> (q-0.3).abs()[1]
0.20000000000000001

We can get your results though, with a little bit of manipulation and by utilizing the decimal module.

>>> from decimal import Decimal, getcontext
>>> import pandas
>>> s = [0.5,0.5,0,1,0.5,0.5]
>>> dec_s = [Decimal(x) for x in s]
>>> q = pandas.Series(dec_s)
>>> q
0    0.5
1    0.5
2      0
3      1
4    0.5
5    0.5
dtype: object
>>> getcontext().prec
28
>>> getcontext().prec = 2
>>> (q-Decimal(0.3)).abs() > (q-Decimal(0.7)).abs()
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

A few things to note:

The list of values is converted from float to decimal data types before being added to the Series.
The dtype is now an object instead of float64. This is because numpy doesn't handle Decimal types directly.
The default precision of the decimal type of 28 places after the decimal. I've chopped it to 2. Normally the decimal module can handle this automatically, but with the numpy interaction (I assume), it gets confused and we end up with large float like numbers. The smaller precision matches your data set.
The 0.3 and 0.7 values used in the comparison must also be Decimals, otherwise you will see an error similar to unsupported operand type(s) for +: 'Decimal' and 'float'.

edited May 23 '17 at 11:49

Community

1
1

answered Aug 23 '14 at 03:10

Andy

49,085
60
166
233

Have you tried multiplying them by some multiple of 10? Then you could truncate them and evaluate as an integer. – Matt Aug 23 '14 at 03:15
1

+1 for the floating point issue. Not sure about the workaround, although I confess to never (needing to) care about this kind of preciseness, Decimal will be pretty slow as it's object... – Andy Hayden Aug 23 '14 at 05:49
@AndyHayden Everything is an object in Python I think, but you may be right that there are certain optimizations that may not take place. It'd be worth a benchmark if that's the bottleneck. – CornSmith Aug 23 '14 at 06:21
1

@CornSmith The point is not everything is a python object! Being float64 (or whatever) means that they are stored *not* as python objects but as contiguous C-arrays of the same type (rather than an array of pointers to python objects)... Rule of thumb is that each loop is 3 times slower, but here each *comparison* is comparing Decimal objects (in python)... so for a 6000 item Series I see this solution 1000 times slower than using float64s! Probably won't be the bottleneck, but still. – Andy Hayden Aug 24 '14 at 05:50
@AndyHayden Oh wow, interesting! Thanks for clearing that up for me – CornSmith Aug 25 '14 at 21:14

Andy Hayden · Answer 2 · 2014-08-26T05:13:43.200

Andy's answer is spot on for the reason (this is a floating point issue, and also an issue of how pandas truncates floating points when printing in a Series/DataFrame...).

You might like to use the numpy function isclose:

In [11]: a = (q-0.3).abs()

In [12]: b = (q-0.7).abs()

In [13]: np.isclose(a, b)
Out[13]: array([ True,  True, False, False,  True,  True], dtype=bool)

I don't think there's a native pandas function to do this, happy to be called out on that...

This has a default tolerance (atol) of 1e-8, so it may make sense for us to use that when testing greater than (to get your desired result):

In [14]: a > b + 1e-8
Out[14]:
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

Update: Just to comment further on the performance aspect, we see float64 are 1000 times faster for a Series with 6000 elements (this gets worse as length increases):

In [21]: q = pd.Series([0.5, 0.5, 0, 1, 0.5, 0.5] * 1000)

In [22]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 726 µs per loop

In [23]: dec_s = q.apply(Decimal)

In [24]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 915 ms per loop

The difference is even starker with more elements:

In [31]: q = pd.Series([0.5, 0.5 ,0, 1, 0.5, 0.5] * 10000)

In [32]: %timeit a = (q-0.3).abs(); b = (q-0.7).abs(); a > b + 1e-8
1000 loops, best of 3: 1.5 ms per loop

In [33]: dec_s = q.apply(Decimal)

In [34]: %timeit (dec_s-Decimal(0.3)).abs() > (dec_s-Decimal(0.7)).abs()
1 loops, best of 3: 9.16 s per loop

Selection of a Series in Python Pandas

2 Answers2