(pandas version 0.16.0, numpy version 1.9.2)
I'm trying to bin values in a column and find the rows in the original data corresponding to the max values of each bin.
I found a way to accomplish this, and the approach is working on some float sample data, but not on int data:
>>> from pandas import *
>>> df1 = DataFrame({"id": range(3),"a": np.random.random(3)})
>>> df2 = DataFrame({"id": range(3),"a": [0,1,5]})
>>> bins = [0,1,2]
>>> grouped1 = df1.a.groupby(cut(df1.a,bins))
>>> grouped2 = df2.a.groupby(cut(df2.a,bins))
>>> idx1 = grouped1.transform(max) == df1.a
>>> df1[idx1]
a id
0 0.997843 0
>>> idx2 = grouped2.transform(max) == df2.a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 2418, in transform
return self._transform_fast(cyfunc)
File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 2459, in _transform_fast
return self._set_result_index_ordered(Series(values))
File "/usr/lib/python2.7/site-packages/pandas/core/groupby.py", line 493, in _set_result_index_ordered
index = Index(np.concatenate([ indices[v] for v in self.grouper.result_index ]))
KeyError: '(1, 2]'
Note that both groups get a NaN row with these bins:
>>> grouped1.max()
a
(0, 1] 0.859684
(1, 2] NaN
Name: a, dtype: float64
>>> grouped2.max()
a
(0, 1] 1
(1, 2] NaN
Name: a, dtype: float64
I'm having trouble understanding what the problem is. The KeyError with a bin value doesn't make much sense to me.