1

This is a relative question of the post How to extract rows from an numpy array based on the content?, and I used the following code to split rows based on the content in the column:

np.split(sorted_a,np.unique(sorted_a[:,1],return_index=True)[1][1:])

the code worked fine, but later I tried the code to split other cases (as below), I found that there could be wrong results (as showed in CASE#1).

CASE#1
[[2748309, 246211, 1],
 [2748309, 246211, 2],
 [2747481, 246201, 54]]
OUTPUT#1
[]
[[2748309, 246211, 1],
 [2748309, 246211, 2],
 [2747481, 246201, 54]]
the result I want
[[2748309, 246211, 1],
 [2748309, 246211, 2]]
[[2747481, 246201, 54]]

I think the code may successfully split rows only in the case with little numbers, which with less digits, and I don't know how to solve problems showed in CASE#1 above. So in this post, I have 2 little relative questions:

1. How to split rows with greater numbers in it? (as showed in CASE #1)?

2. How to handle (split) data with both cases including #1 rows with the same element in the second column, but different in the first, and #2 rows with the same element in the first column, but different in the second ? (That is, could python distinguish rows considering contents in both first and second columns simultaneously?)

Feel free to give me suggestions, thank you.

Update#1

The ravel_multi_index function could handle this kind of task with integer-arrays, but how to deal with arrays containing float?

Community
  • 1
  • 1
Heinz
  • 2,415
  • 6
  • 26
  • 34

3 Answers3

1

Here's an approach considering pair of elements from each row as indexing tuples -

# Convert to linear index equivalents
lidx = np.ravel_multi_index(arr[:,:2].T,arr[:,:2].max(0)+1)

# Get sorted indices of lidx. Using those get shifting indices.
# Split along sorted input array along axis=0 using those.
sidx = lidx.argsort()
out = np.split(arr[sidx],np.unique(lidx[sidx],return_index=1)[1][1:])

Sample run -

In [34]: arr
Out[34]: 
array([[2, 7, 5],
       [3, 4, 6],
       [2, 3, 5],
       [2, 7, 7],
       [4, 4, 7],
       [3, 4, 6],
       [2, 8, 5]])

In [35]: out
Out[35]: 
[array([[2, 3, 5]]), array([[2, 7, 5],
        [2, 7, 7]]), array([[2, 8, 5]]), array([[3, 4, 6],
        [3, 4, 6]]), array([[4, 4, 7]])]

For a detailed info on converting group of elements as indexing tuple, please refer to this post.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you for the suggestion and the detailed link, the ravel_multi_index function could handle the array with integers, but I am wondering how to do the same job as the array with float because the function seems to work only with integers. – Heinz Sep 30 '16 at 06:36
  • 1
    @Heinz In the first step to calculate `lidx`, use `np.unique(a[:,:2],return_inverse=1)[1].reshape(-1,2)` in place of `arr[:,:2]`. – Divakar Sep 30 '16 at 08:33
0

The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform these type of operations:

import numpy_indexed as npi
npi.group_by(a[:, :2]).split(a)

It has decent test coverage, so id be surprised if it tripped on your seemingly straightforward test case.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thank you for the answer, I would download and test this numpy_indexed package, but I prefer to solve this problem just with python and numpy. Anyway, thank you. – Heinz Sep 29 '16 at 14:37
0

If I apply that split line directly to your array I get your result, an empty array plus the original

In [136]: np.split(a,np.unique(a[:,1],return_index=True)[1][1:])
Out[136]: 
[array([], shape=(0, 3), dtype=int32), 
 array([[2748309,  246211,       1],
        [2748309,  246211,       2],
        [2747481,  246201,      54]])]

But if I first sort the array on the 2nd column, as specified in the linked answer, I get the desired answer - with the 2 arrays switched

In [141]: sorted_a=a[np.argsort(a[:,1])]
In [142]: sorted_a
Out[142]: 
array([[2747481,  246201,      54],
       [2748309,  246211,       1],
       [2748309,  246211,       2]])
In [143]: np.split(sorted_a,np.unique(sorted_a[:,1],return_index=True)[1][1:])
Out[143]: 
[array([[2747481,  246201,      54]]), 
 array([[2748309,  246211,       1],
        [2748309,  246211,       2]])]
hpaulj
  • 221,503
  • 14
  • 230
  • 353