1

I have an array in the following form where the first two columns are supposed to be indices of a 2-dimensional array and the following columns are arbitrary values.

data = np.array([[ 0. ,  1. , 48. ,  4. ],
                 [ 1. ,  2. , 44. ,  4.4],
                 [ 1. ,  1. , 34. ,  2.3],
                 [ 0. ,  2. , 55. ,  2.2],
                 [ 0. ,  0. , 42. ,  2. ],
                 [ 1. ,  0. , 22. ,  1. ]])

How do I combine the indices data[:,:2] with their values data[:,2:] such that the resulting array is accessible by the indices in the first two columns.

In my example that would be:

result = np.array([[[42. ,  2. ], [48. ,  4. ], [55. ,  2.2]],
                   [[22. ,  1. ], [34. ,  2.3], [44. ,  4.4]]])

I know that there is a trivial solution using python loops. But performance is a concern since I'm dealing with a huge amount of data. Specifically it's output of another program that I need to process.

Maybe there is a relatively trivial numpy solution as well. But I'm kind of stuck.

If it helps the following can be safely assumed:

  • All numbers in the first two columns are whole numbers (although the array consists of floats).
  • Every possible index (or rather combinations of indices) in the original array is used exactly once. I.e. there is guaranteed to be exactly one entry of the form [i, j, ...].
  • The indices start at 0 and I know the highest indices beforehand.

Edit:

Hmm. I see now how my example is misleading. The truth is that some of my input arrays are sorted, but that's unreliable. So I shouldn't assume anything about the order. I reordered some rows in my example to make it clearer. In case anyone wants to make sense of the answer and comment below: In my original question the array appeared to be sorted by the first two columns.

Scindix
  • 1,254
  • 2
  • 15
  • 32

2 Answers2

1

find row, column, depth base your data array, then fill like below:

import numpy as np
data = np.array([[ 0. ,  0. , 42. ,  2. ],
                 [ 0. ,  1. , 48. ,  4. ],
                 [ 0. ,  2. , 55. ,  2.2],
                 [ 1. ,  0. , 22. ,  1. ],
                 [ 1. ,  1. , 34. ,  2.3],
                 [ 1. ,  2. , 44. ,  4.4]])

row = int(max(data[:,0]))+1
col = int(max(data[:,1]))+1
depth = len(data[0, 2:])

out = np.zeros([row, col, depth])

out = data[:, 2:].reshape(row,col,depth)
print(out)

Output:

[[[42.   2. ]
  [48.   4. ]
  [55.   2.2]]

 [[22.   1. ]
  [34.   2.3]
  [44.   4.4]]]
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • My input array isn't necessarily sorted (see my edit). I could of course sort the array first like so: https://stackoverflow.com/a/46230001/3139807 I thought there might be a more efficient way that doesn't require sorting. But I guess it still beats looping over the array by a lot. So I'm sticking with this solution. – Scindix Jul 02 '22 at 15:52
0

You can use numba in no-python parallel mode with loops (which is inherently for python loops acceleration) that will be one of the most efficient methods in terms of performance as szczesny mentioned in the comments, that won't need to sort; this code is adjusted for when column counts are 2, if it be changeable, this code can be modified to handle that:

# without signature --> @nb.njit(parallel=True)
@nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
    data_ = data[:, :2].astype(np.int8)
    res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, 2))
    for i in nb.prange(data_.shape[0]):
        res[data_[i, 0], data_[i, 1], 0] = data[i, 2]
        res[data_[i, 0], data_[i, 1], 1] = data[i, 3]
    return res

without the sorting and curing the proposed NumPy code (horizontal axis --> data.shape[0]):

enter image description here

More general to consider more than 2 columns:

@nb.njit("float64[:, :, ::1](float64[:, ::1])", parallel=True)
def numba_(data):
    data_ = data[:, :2].astype(np.int8)
    assert data_.shape[0] == data.shape[0]
    depth = data[:, 2:].shape[1]
    res = np.empty((data_[:, 0].max() + 1, data_[:, 1].max() + 1, depth))
    for i in nb.prange(data_.shape[0]):
        for j in range(depth):
            res[data_[i, 0], data_[i, 1], j] = data[i, j + 2]
    return res
Ali_Sh
  • 2,667
  • 3
  • 43
  • 66