1

(pandas version 0.16.0, numpy version 1.9.2)

I'm trying to bin values in a column and find the rows in the original data corresponding to the max values of each bin.

I found a way to accomplish this, and the approach is working on some float sample data, but not on int data:

>>> from pandas import *
>>> df1 = DataFrame({"id": range(3),"a": np.random.random(3)})
>>> df2 = DataFrame({"id": range(3),"a": [0,1,5]})
>>> bins = [0,1,2]
>>> grouped1 = df1.a.groupby(cut(df1.a,bins))
>>> grouped2 = df2.a.groupby(cut(df2.a,bins))
>>> idx1 = grouped1.transform(max) == df1.a
>>> df1[idx1]
           a  id
0  0.997843  0
>>> idx2 = grouped2.transform(max) == df2.a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 2418, in transform
    return self._transform_fast(cyfunc)
  File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 2459, in _transform_fast
    return self._set_result_index_ordered(Series(values))
  File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 493, in _set_result_index_ordered
    index = Index(np.concatenate([ indices[v] for v in self.grouper.result_index ]))
KeyError: '(1, 2]'

Note that both groups get a NaN row with these bins:

>>> grouped1.max()
a
(0, 1]    0.859684
(1, 2]         NaN
Name: a, dtype: float64
>>> grouped2.max()
a
(0, 1]     1
(1, 2]   NaN
Name: a, dtype: float64

I'm having trouble understanding what the problem is. The KeyError with a bin value doesn't make much sense to me.

keyser
  • 18,829
  • 16
  • 59
  • 101
  • I can't run your code as 'cut' is not defined, but what version pandas and numpy are you using? – EdChum Apr 17 '15 at 12:37
  • `from pandas import *`. I'll add it. – keyser Apr 17 '15 at 12:37
  • 1
    Don't know if it is of any help, but as far as I know the concept of `NaN` is defined for floating point only, not for integers (at least in C#, for example). – heltonbiker Apr 17 '15 at 12:38
  • 1
    Well, there's also this answer from _the man_ himself: http://stackoverflow.com/a/11548224/401828 – heltonbiker Apr 17 '15 at 12:39
  • OK, I get an error just with this: `grouped2.transform(max)` I'm running python 3.4, numpy 1.9.1 and pandas 0.16.0 – EdChum Apr 17 '15 at 12:39
  • @EdChum Yes, I also noted that. transform is throwing the error. – keyser Apr 17 '15 at 12:40
  • @heltonbiker That's a good point, you might be right. Though I'd expect some ValueError or something for such a problem. – keyser Apr 17 '15 at 12:44
  • It still fails even if the dtype of column 'a' is float64, this may be a more subtle problem in the data values themselves – EdChum Apr 17 '15 at 12:45
  • I suspect a problem with categoricals: they've only been promoted relatively recently. – DSM Apr 17 '15 at 12:55
  • iirc this is fixed in master (but could be something else) – Jeff Apr 17 '15 at 13:08
  • 2
    I've opened a ticket, [#9921](https://github.com/pydata/pandas/issues/9921). This is still a bug in trunk. – DSM Apr 17 '15 at 15:25

0 Answers0