It's because the groupby
object handles the bookkeeping and the grouper
objects just reference their key
and the parent groupby
object:
typedef struct {
PyObject_HEAD
PyObject *it; /* iterator over the input sequence */
PyObject *keyfunc; /* the second argument for the groupby function */
PyObject *tgtkey; /* the key for the current "grouper" */
PyObject *currkey; /* the key for the current "item" of the iterator*/
PyObject *currvalue; /* the plain value of the current "item" */
} groupbyobject;
typedef struct {
PyObject_HEAD
PyObject *parent; /* the groupby object */
PyObject *tgtkey; /* the key value for this grouper object. */
} _grouperobject;
Since you're not iterating the grouper
object when you unpack the groupby
object I'll ignore them for now. So what's interesting is what happens in the groupby
when you call next
on it:
static PyObject *
groupby_next(groupbyobject *gbo)
{
PyObject *newvalue, *newkey, *r, *grouper;
/* skip to next iteration group */
for (;;) {
if (gbo->currkey == NULL)
/* pass */;
else if (gbo->tgtkey == NULL)
break;
else {
int rcmp;
rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
if (rcmp == 0)
break;
}
newvalue = PyIter_Next(gbo->it);
if (newvalue == NULL)
return NULL; /* just return NULL, no invalidation of attributes */
newkey = PyObject_CallFunctionObjArgs(gbo->keyfunc, newvalue, NULL);
gbo->currkey = newkey;
gbo->currvalue = newvalue;
}
gbo->tgtkey = gbo->currkey;
grouper = _grouper_create(gbo, gbo->tgtkey);
r = PyTuple_Pack(2, gbo->currkey, grouper);
return r;
}
I removed all the irrelevant exception handling code and removed or simplified pure reference counting stuff. The interesting thing here is that when you reach the end of the iterator the gbo->currkey
, gbo->currvalue
and gbo->tgtkey
aren't set to NULL
, they will still point to the last encountered values (the last item of the iterator) because it just return NULL
when PyIter_Next(gbo->it) == NULL
.
After this finished you have your two grouper
objects. The first one will have a tgtvalue
of False
and the second with True
. Let's have a look what happens when you call next
on these grouper
s:
static PyObject *
_grouper_next(_grouperobject *igo)
{
groupbyobject *gbo = (groupbyobject *)igo->parent;
PyObject *newvalue, *newkey, *r;
int rcmp;
if (gbo->currvalue == NULL) {
/* removed because irrelevant. */
}
rcmp = PyObject_RichCompareBool(igo->tgtkey, gbo->currkey, Py_EQ);
if (rcmp <= 0)
/* got any error or current group is end */
return NULL;
r = gbo->currvalue; /* this accesses the last value of the groupby object */
gbo->currvalue = NULL;
gbo->currkey = NULL;
return r;
}
So remember currvalue
is not NULL
, so the first if
branch isn't interesting. For your first grouper it compares the tgtkey
of the grouper
and the groupby
object and sees that they differ and it will immediatly return NULL
. So you got an empty list.
For the second iterator the tgtkey
s are identical, so it will return the currvalue
of the groupby
object (which is the last encountered value in the iterator!), but this time it will set the currvalue
and currkey
of the groupby
object to NULL
.
Switching back to python: The really interesting quirks happen if you have a grouper
with the same tgtkey
as the last group in your groupby
:
import itertools
>>> inputs = [(x > 5, x) for x in range(10)] + [(False, 10)]
>>> (_, g1), (_, g2), (_, g3) = itertools.groupby(inputs, key=lambda x: x[0])
>>> list(g1)
[(False, 10)]
>>> list(g3)
[]
That one element in g1
didn't belong to the first group at all - but because the tgtkey
of the first grouper object is False
and the last tgtkey
is False
the first grouper thought it belongs into the first group. It also invalidated the groupby
object so the third group is now empty.
All the code was taken from the Python source code but shortened.