Python - How to split an array based on the first column?

Question

I have below fake data. After reading it into array it will have shape (8, 3). Now I want to split the data based on the first column(ID) and return a list of array whose shape will be:[(3,3),(2,3),(3,3)]. I think np.split could do the job by assigning a 1-D array to "indices_or_sections" argument. But is there any more convenient way to do this?

`np.split(a,np.flatnonzero(np.diff(a[:,0]))+1)` is about as convenient as it gets. — Paul Panzer, Oct 04 '19 at 03:05
This may be a duplicate: https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function — Cory Nezin, Oct 04 '19 at 03:32
NumPy arrays usually contain data that is all of the same type, and all measuring the same thing. It looks a lot like [`pandas`](https://pandas.pydata.org/) will help you, as its explicitly designed for handling columnar data like this. — Matt Hall, Oct 04 '19 at 14:00
Possible duplicate of [Is there any numpy group by function?](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) — Matt Hall, Oct 04 '19 at 14:01
Yes https://stackoverflow.com/a/53859634/7554103 in the post should work. Thanks, CoryNezin kwinkunks — wwj123, Oct 05 '19 at 01:14
@JacobIRR I guess pandas could also work but I am going to work on large scale data so probably numpy would be faster? — wwj123, Oct 05 '19 at 01:15
@kwinkunks I guess pandas could also work but I am going to work on large scale data so probably numpy would be faster? — wwj123, Oct 05 '19 at 01:15
@CarlosWen Doubtful: Pandas uses NumPy arrays to represent its data. If you need more horsepower (e.g. have huge dataframes), there's [`dask`](https://docs.dask.org/en/latest/dataframe.html) or [`vaex`](https://github.com/vaexio/vaex), both of which work out-of-core (i.e. the dataframe doesn't have to fit in memory). — Matt Hall, Oct 05 '19 at 12:54

score 0 · Answer 1 · answered Oct 06 '19 at 22:11

You can achieve this by using a combination of np.split, sort, np.unique and np.cumsum.

>>> a = [[1, 700, 35],
...      [1, 700, 35],
...      [1, 700, 35],
...      [2, 680, 25],
...      [2, 680, 25],
...      [3, 750, 40],
...      [3, 750, 40],
...      [3, 750, 40]]
>>> a = np.array(a)
>>> # sort the array by first column. 
>>> a = a[a[:,0].argsort()]
>>> np.split(a, np.cumsum(np.unique(a[:, 0], return_counts=True)[1])[:-1])
[array([[  1, 700,  35],
       [  1, 700,  35],
       [  1, 700,  35]]), array([[  2, 680,  25],
       [  2, 680,  25]]), array([[  3, 750,  40],
       [  3, 750,  40],
       [  3, 750,  40]])]

Python - How to split an array based on the first column?

1 Answers1