0

I have below fake data. After reading it into array it will have shape (8, 3). Now I want to split the data based on the first column(ID) and return a list of array whose shape will be:[(3,3),(2,3),(3,3)]. I think np.split could do the job by assigning a 1-D array to "indices_or_sections" argument. But is there any more convenient way to do this?

1   700 35
1   700 35
1   700 35
2   680 25
2   680 25
3   750 40
3   750 40
3   750 40
micharaze
  • 957
  • 8
  • 25
wwj123
  • 365
  • 2
  • 12
  • are you open to non numpy solutions? – JacobIRR Oct 04 '19 at 02:30
  • If values in the 1st column are continuous? – BAKE ZQ Oct 04 '19 at 02:51
  • `np.split(a,np.flatnonzero(np.diff(a[:,0]))+1)` is about as convenient as it gets. – Paul Panzer Oct 04 '19 at 03:05
  • 1
    This may be a duplicate: https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function – Cory Nezin Oct 04 '19 at 03:32
  • NumPy arrays usually contain data that is all of the same type, and all measuring the same thing. It looks a lot like [`pandas`](https://pandas.pydata.org/) will help you, as its explicitly designed for handling columnar data like this. – Matt Hall Oct 04 '19 at 14:00
  • Possible duplicate of [Is there any numpy group by function?](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) – Matt Hall Oct 04 '19 at 14:01
  • Yes https://stackoverflow.com/a/53859634/7554103 in the post should work. Thanks, CoryNezin kwinkunks – wwj123 Oct 05 '19 at 01:14
  • @JacobIRR I guess pandas could also work but I am going to work on large scale data so probably numpy would be faster? – wwj123 Oct 05 '19 at 01:15
  • @kwinkunks I guess pandas could also work but I am going to work on large scale data so probably numpy would be faster? – wwj123 Oct 05 '19 at 01:15
  • @BAKEZQ no they could be any random numbers. – wwj123 Oct 05 '19 at 01:17
  • @PaulPanzer yes, I think you are correct. – wwj123 Oct 05 '19 at 01:21
  • @CarlosWen Doubtful: Pandas uses NumPy arrays to represent its data. If you need more horsepower (e.g. have huge dataframes), there's [`dask`](https://docs.dask.org/en/latest/dataframe.html) or [`vaex`](https://github.com/vaexio/vaex), both of which work out-of-core (i.e. the dataframe doesn't have to fit in memory). – Matt Hall Oct 05 '19 at 12:54

1 Answers1

0

You can achieve this by using a combination of np.split, sort, np.unique and np.cumsum.

>>> a = [[1, 700, 35],
...      [1, 700, 35],
...      [1, 700, 35],
...      [2, 680, 25],
...      [2, 680, 25],
...      [3, 750, 40],
...      [3, 750, 40],
...      [3, 750, 40]]
>>> a = np.array(a)
>>> # sort the array by first column. 
>>> a = a[a[:,0].argsort()]
>>> np.split(a, np.cumsum(np.unique(a[:, 0], return_counts=True)[1])[:-1])
[array([[  1, 700,  35],
       [  1, 700,  35],
       [  1, 700,  35]]), array([[  2, 680,  25],
       [  2, 680,  25]]), array([[  3, 750,  40],
       [  3, 750,  40],
       [  3, 750,  40]])]
Anirudh
  • 52
  • 1
  • 7