2

I'm trying to understand a quirk that I discovered in an answer that I wrote up earlier today. Basically, I was yielding groups from a generator function that wrapped itertools.groupby. The interesting thing that I discovered is that if I unpack the generator on the left hand side of an assignment, the last element from the generator remains. For example:

# test_gb.py
from itertools import groupby
from operator import itemgetter

inputs = ((x > 5, x) for x in range(10))

def make_groups(inputs):
    for _, group in groupby(inputs, key=itemgetter(0)):
        yield group

a, b = make_groups(inputs)
print(list(a))
print(list(b))

On Cpython, this results in:

$ python3 ~/sandbox/test_gb.py 
[]
[(True, 9)]

this is the case on both CPython2.7 and CPython3.5.

On PyPy, it results in:

$ pypy ~/sandbox/test_gb.py 
[]
[]

In both cases, the first empty list ("a") is pretty easy to explain -- The group from itertools gets consumed as soon as the next element is required. Since we didn't save those values anywhere, they're lost to the ether.

It seems to me that the PyPy version makes sense for the second empty list ("b") too ... When unpacking, we consume b too (because python needs to look for what's after to make sure that it shouldn't throw a ValueError for the wrong number of items to unpack). For some reason though, the CPython version retains the last element from the input iterable ... Can anyone explain why this might be?

Edit

This is probably more or less obvious, but we can also write it as:

inputs = ((x > 5, x) for x in range(10))
(_, a), (_, b) = groupby(inputs, key=itemgetter(0))
print(list(a))
print(list(b))

and get the same results ...

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 1
    We have changed PyPy to behave like CPython. I know this is not really answering the question "why": just copying the exact algorithm from CPython into PyPy, instead of using a version reconstructed from the docs, ensures we get the same results. – Armin Rigo May 11 '17 at 10:18
  • 1
    The behavior seems a bit broken. I think you may want to ask this question in a CPython issue report. Maybe it will be fixed for the next 3.7 release; or maybe someone will find a reason for why the current behavior is better; or maybe your issue will remain open forever. – Armin Rigo May 11 '17 at 10:22
  • 1
    For anyone else interested, It looks like the pypy change was made in [this commit](https://bitbucket.org/pypy/pypy/commits/6093ff1a44e6b17f09db83aa80aea562a738c286) – mgilson May 11 '17 at 13:26
  • 1
    And I've reported it as a bug to the CPython folks. You can track it at http://bugs.python.org/issue30346 – mgilson May 11 '17 at 19:16

1 Answers1

2

It's because the groupby object handles the bookkeeping and the grouper objects just reference their key and the parent groupby object:

typedef struct {
    PyObject_HEAD
    PyObject *it;          /* iterator over the input sequence */
    PyObject *keyfunc;     /* the second argument for the groupby function */
    PyObject *tgtkey;      /* the key for the current "grouper" */
    PyObject *currkey;     /* the key for the current "item" of the iterator*/
    PyObject *currvalue;   /* the plain value of the current "item" */
} groupbyobject;

typedef struct {
    PyObject_HEAD
    PyObject *parent;      /* the groupby object */
    PyObject *tgtkey;      /* the key value for this grouper object. */
} _grouperobject;

Since you're not iterating the grouper object when you unpack the groupby object I'll ignore them for now. So what's interesting is what happens in the groupby when you call next on it:

static PyObject *
groupby_next(groupbyobject *gbo)
{
    PyObject *newvalue, *newkey, *r, *grouper;

    /* skip to next iteration group */
    for (;;) {
        if (gbo->currkey == NULL)
            /* pass */;
        else if (gbo->tgtkey == NULL)
            break;
        else {
            int rcmp;

            rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
            if (rcmp == 0)
                break;
        }

        newvalue = PyIter_Next(gbo->it);
        if (newvalue == NULL)
            return NULL;   /* just return NULL, no invalidation of attributes */
        newkey = PyObject_CallFunctionObjArgs(gbo->keyfunc, newvalue, NULL);

        gbo->currkey = newkey;
        gbo->currvalue = newvalue;
    }
    gbo->tgtkey = gbo->currkey;

    grouper = _grouper_create(gbo, gbo->tgtkey);
    r = PyTuple_Pack(2, gbo->currkey, grouper);
    return r;
}

I removed all the irrelevant exception handling code and removed or simplified pure reference counting stuff. The interesting thing here is that when you reach the end of the iterator the gbo->currkey, gbo->currvalue and gbo->tgtkey aren't set to NULL, they will still point to the last encountered values (the last item of the iterator) because it just return NULL when PyIter_Next(gbo->it) == NULL.

After this finished you have your two grouper objects. The first one will have a tgtvalue of False and the second with True. Let's have a look what happens when you call next on these groupers:

static PyObject *
_grouper_next(_grouperobject *igo)
{
    groupbyobject *gbo = (groupbyobject *)igo->parent;
    PyObject *newvalue, *newkey, *r;
    int rcmp;

    if (gbo->currvalue == NULL) {
        /* removed because irrelevant. */
    }

    rcmp = PyObject_RichCompareBool(igo->tgtkey, gbo->currkey, Py_EQ);
    if (rcmp <= 0)
        /* got any error or current group is end */
        return NULL;

    r = gbo->currvalue;  /* this accesses the last value of the groupby object */
    gbo->currvalue = NULL;
    gbo->currkey = NULL;

    return r;
}

So remember currvalue is not NULL, so the first if branch isn't interesting. For your first grouper it compares the tgtkey of the grouper and the groupby object and sees that they differ and it will immediatly return NULL. So you got an empty list.

For the second iterator the tgtkeys are identical, so it will return the currvalue of the groupby object (which is the last encountered value in the iterator!), but this time it will set the currvalue and currkey of the groupby object to NULL.


Switching back to python: The really interesting quirks happen if you have a grouper with the same tgtkey as the last group in your groupby:

import itertools

>>> inputs = [(x > 5, x) for x in range(10)] + [(False, 10)]
>>> (_, g1), (_, g2), (_, g3) = itertools.groupby(inputs, key=lambda x: x[0])
>>> list(g1)
[(False, 10)]
>>> list(g3)
[]

That one element in g1 didn't belong to the first group at all - but because the tgtkey of the first grouper object is False and the last tgtkey is False the first grouper thought it belongs into the first group. It also invalidated the groupby object so the third group is now empty.


All the code was taken from the Python source code but shortened.

MSeifert
  • 145,886
  • 38
  • 333
  • 352