6

The Situation

I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:

df = pd.DataFrame({'A': [list with classifier ids],  # Only 3 ids, One word strings
                   'B': [List of text to be classified],  # Millions of unique rows, lines of text around 5-25 words long
                   'C': [List of the old classes]}  # Hundreds of possible classes, four digit integers stored as strings

df.sort_values('A', inplace=True)

new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
    classifier = classy_dict[name]
    vectors = vectorize(group.B.values)

    preds = classifier.predict(vectors)
    scores = classifier.decision_function(vectors)

    for tup in zip(preds, scores, group.C.values):
        if tup[2] == tup[0]:
            new_col1.append(np.nan)
            new_col2.append(tup[2])

        else:
            new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
            new_col2.append(np.nan)

df['D'] = new_col1
df['E'] = new_col2

The Issue

I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs

My Expectations

All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.

Here is the code I used to test my theory on sort=False iteration order:

from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers

df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
                   'B': randint(10, size=100)})

print(df.A.unique())  # unique values in order of appearance per the docs

for name, group in df.groupby('A', sort=False):
    print(name)

Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.

Eric Ed Lohmar
  • 1,832
  • 1
  • 17
  • 26
  • We'd like to see your actual data and expected output. You've too much text here. – cs95 Nov 08 '17 at 17:23
  • Since the question is about the operation of the `groupby` function specifically, I don't see how actual data is relevant. I have simplified the text in the question and added notes regarding what the data looks like. – Eric Ed Lohmar Nov 08 '17 at 17:32

2 Answers2

8

Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.

def ngroup(self, ascending=True):
    """
    Number each group from 0 to the number of groups - 1.
    This is the enumerative complement of cumcount.  Note that the
    numbers given to the groups match the order in which the groups
    would be seen when iterating over the groupby object, not the
    order they are first observed.
    ""

Data from @coldspeed

df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()

Output:

    col  sort=False  sort=True
0   16           0          7
1    1           1          0
2   10           2          5
3   20           3          8
4    3           4          2
5   13           5          6
6    2           6          1
7    5           7          3
8    7           8          4

When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.

ALollz
  • 57,915
  • 7
  • 66
  • 89
1

Let's do a little empirical test. You can iterate over groupby and see the order in which groups are iterated over.

df

   col
0   16
1    1
2   10
3   20
4    3
5   13
6    2
7    5
8    7

for c, g in df.groupby('col', sort=False):
      print(c)  

16
1
10
20
3
13
2
5
7

It appears that the order is preserved.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • My concern is the term *appears* in your answer. If you iterate through a `set` several times, it can, coincidentally, run in the same order creating the *appearance* that it is consistent, but it can not be relied on. I'm looking for some sort of proof or documentation that it performs, undeniably, in the manner that I expect. Since I'm 85% sure this question was fueled by caffeine-driven paranoia and you will kind enough to answer anyway, I will accept your answer if there is not a better one this time tomorrow. – Eric Ed Lohmar Nov 08 '17 at 18:07
  • I am also going to edit my question to make my expectations more clear. – Eric Ed Lohmar Nov 08 '17 at 18:08
  • 1
    @EricEdLohmar Based on this https://github.com/pandas-dev/pandas/issues/8588, it seems they added the feature to preserve the order, so yes, it is preserved. – cs95 Nov 08 '17 at 18:09
  • Note that I believe it is referring to order between groups, not within groups. – cs95 Nov 08 '17 at 18:12
  • That document leads to [this SO question](https://stackoverflow.com/q/26456125/2000793), which specifically states order within the groups, not order between groups. – Eric Ed Lohmar Nov 08 '17 at 18:14
  • @EricEdLohmar Ah, afraid I'm not sure of the latter. – cs95 Nov 08 '17 at 18:15
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/158536/discussion-between-eric-ed-lohmar-and-cs). – Eric Ed Lohmar Nov 08 '17 at 18:18