In Python, split array rows into groups according to the value in specific column of that array

Question

I have an array where each row of data follows a sequential order, identified by a label column at the end. As a small example, its format is similar to this:

arr = [[1,2,3,1], 
       [2,3,4,1],
       [3,4,5,1],
       [4,5,6,2],
       [5,6,7,2],
       [7,8,9,2],
       [9,10,11,3]]

I would like to split the array into groups using the label column as the group-by marker. So the above array would produce 3 arrays:

arrA = [[1,2,3,1], 
        [2,3,4,1],
        [3,4,5,1]]

arrB = [[4,5,6,2],
        [5,6,7,2],
        [7,8,9,2]]

arrC = [9,10,11,3]

I currently have this FOR loop, storing each group array in a wins list:

wins = []
for w in range(1, arr[-1,3]+1):
    wins.append(arr[arr[:, 3] == w, :])

This does the job okay but I have several large datasets to process so is there a vectorized way of doing this, maybe by using diff() or where() from the numpy library?

@Georgy, the answer in the link you gave works for me with a simple index modification, thanks. So I will mark my question as duplicate — humbleHacker, Aug 20 '19 at 09:46

Baptiste Vauthey · Answer 1 · 2019-08-19T20:32:34.990

0

I think this piece of code would be more than fast enough with any dataset that's not absolutely massive:

for a in arr:
    while True:
        try:
            wins[a[-1]].append(a)
            break
        except IndexError:
            wins.append([])

You definitely won't get anything better than O(n). If your data is stored somewhere else, like a SQL database or something, you'd probably be better off running this logic in the sql query itself.

edited Aug 19 '19 at 20:32

answered Aug 19 '19 at 18:30

Baptiste Vauthey

64
4

This code doesn't actually work -- you want the break statement after the try (otherwise it'll break without appending the item in the instance a[-1] isn't a valid index)! – Ravenlocke Aug 19 '19 at 19:05
@Ravenlocke You're right. Edited – Baptiste Vauthey Aug 19 '19 at 20:32

score 0 · Answer 2 · answered Aug 19 '19 at 19:01

Okay, I did some more digging using the "numpy group by" search criteria, thanks to the guy who commented but has now removed their comment, and found this very similar question: Is there any numpy group by function?.

I adapted the answer from Vincent J (https://stackoverflow.com/users/1488055/vincent-j) to this and it produced the correct result:

wins = np.split(arr[:, :], np.cumsum(np.unique(arr[:, 3], return_counts=True)[1])[:-1])

I will go with this code but by all means chip in if anyone thinks there's a better way.

score 0 · Answer 3 · answered Aug 19 '19 at 19:10

I know you seem to want arrays, but I think for what you seem to be asking that a dict is possibly an easier way to approach this?

from collections import defaultdict

wins = defaultdict(list)

for item in arr:
    wins[item[-1]].append(item)

Then your separate arrays you want are the values in wins (e.g., wins[1] is an array of items where the label is 1).

Just seems a little more Pythonic and readable to me!

In Python, split array rows into groups according to the value in specific column of that array

3 Answers3