3

I have an array where each row of data follows a sequential order, identified by a label column at the end. As a small example, its format is similar to this:

arr = [[1,2,3,1], 
       [2,3,4,1],
       [3,4,5,1],
       [4,5,6,2],
       [5,6,7,2],
       [7,8,9,2],
       [9,10,11,3]]

I would like to split the array into groups using the label column as the group-by marker. So the above array would produce 3 arrays:

arrA = [[1,2,3,1], 
        [2,3,4,1],
        [3,4,5,1]]

arrB = [[4,5,6,2],
        [5,6,7,2],
        [7,8,9,2]]

arrC = [9,10,11,3]

I currently have this FOR loop, storing each group array in a wins list:

wins = []
for w in range(1, arr[-1,3]+1):
    wins.append(arr[arr[:, 3] == w, :]) 

This does the job okay but I have several large datasets to process so is there a vectorized way of doing this, maybe by using diff() or where() from the numpy library?

humbleHacker
  • 437
  • 6
  • 17
  • @Georgy, the answer in the link you gave works for me with a simple index modification, thanks. So I will mark my question as duplicate – humbleHacker Aug 20 '19 at 09:46

3 Answers3

0

I think this piece of code would be more than fast enough with any dataset that's not absolutely massive:

for a in arr:
    while True:
        try:
            wins[a[-1]].append(a)
            break
        except IndexError:
            wins.append([])

You definitely won't get anything better than O(n). If your data is stored somewhere else, like a SQL database or something, you'd probably be better off running this logic in the sql query itself.

0

Okay, I did some more digging using the "numpy group by" search criteria, thanks to the guy who commented but has now removed their comment, and found this very similar question: Is there any numpy group by function?.

I adapted the answer from Vincent J (https://stackoverflow.com/users/1488055/vincent-j) to this and it produced the correct result:

wins = np.split(arr[:, :], np.cumsum(np.unique(arr[:, 3], return_counts=True)[1])[:-1])

I will go with this code but by all means chip in if anyone thinks there's a better way.

humbleHacker
  • 437
  • 6
  • 17
0

I know you seem to want arrays, but I think for what you seem to be asking that a dict is possibly an easier way to approach this?

from collections import defaultdict

wins = defaultdict(list)

for item in arr:
    wins[item[-1]].append(item) 

Then your separate arrays you want are the values in wins (e.g., wins[1] is an array of items where the label is 1).

Just seems a little more Pythonic and readable to me!

Ravenlocke
  • 69
  • 1
  • 4