9

Assuming I want have a numpy array of size (n,m) where n is very large, but with a lot of duplication, ie. 0:n1 are identical, n1:n2 are identical etc. (with n2%n1!=0, ie not regular intervals). Is there a way to store only one set of values for each of the duplicates while having a view of the entire array?

example:

unique_values = np.array([[1,1,1], [2,2,2] ,[3,3,3]]) #these are the values i want to store in memory
index_mapping = np.array([0,0,1,1,1,2,2]) # a mapping between index of array above, with array below

unique_values_view = np.array([[1,1,1],[1,1,1],[2,2,2],[2,2,2],[2,2,2], [3,3,3],[3,3,3]]) #this is how I want the view to look like for broadcasting reasons

I plan to multiply array(view) by some other array of size (1,m), and take the dot product of this product:

other_array1 = np.arange(unique_values.shape[1]).reshape(1,-1) # (1,m)
other_array2 = 2*np.ones((unique_values.shape[1],1)) # (m,1)
output = np.dot(unique_values_view * other_array1, other_array2).squeeze()

Output is a 1D array of length n.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
M.T
  • 4,917
  • 4
  • 33
  • 52
  • How do you plan to use the output? FYI : A view at the first stage, doesn't guarantee that there won't be forced-copy later on. – Divakar Jun 01 '18 at 07:56
  • @Divakar True. I plan to multiply with another array of shape `(1,m)`, then storing the dot product with another array. The main consideration is fitting the array into memory. I could do the last step in chunks if it copying is enforced – M.T Jun 01 '18 at 08:07
  • Could you add mcve for the entire process, i.e. sample for that (1,m) array and then the final desired output? – Divakar Jun 01 '18 at 08:13
  • That multiplication with `2` doesn't seem like a generic operation. Could you explain its significance? – Divakar Jun 04 '18 at 10:28
  • 1
    Or could add a bit more generic minimal sample, say `index_mapping` with bigger range of numbers and `unique_values` with random numbers? – Divakar Jun 04 '18 at 10:40
  • 1
    By "identical" do you mean they are integers? Or just very close floats? (i.e can you do bitwise equvalence) – Daniel F Jun 05 '18 at 11:05
  • @DanielF If you look at `unique_values_view`, the zeroth and first inner list/array are both from `unique_values[0]`, the second, third and fourth are from `unique_values[1]` etc. The values will typically be floats, but I don't see how the type would matter. I want to duplicate array subsets into one large array, which doesn't store copies of all the duplicates. – M.T Jun 05 '18 at 16:13
  • @Divakar If you are able to show a solution to this minimal example, I will upvote it. If you can show that it doesn't require significantly more memory than `unique_values` using some kind of memory profiling, I will accept it right away. – M.T Jun 05 '18 at 16:19
  • 2
    @M.T Yakym Pirozhenko 's soln seem to work for the listed minimal case. But again, I can't see what would be the generic case. – Divakar Jun 05 '18 at 17:39

2 Answers2

6

Based on your example, you can simply factor the index mapping through the computation to the very end:

output2 = np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]

assert (output == output2).all()
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
1

Your expression admits two significant optimizations:

  • do the indexing last
  • multiply other_array1 with other_array2 first and then with unique_values

Let's apply these optimizations:

>>> output_pp = (unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]

# check for correctness
>>> (output == output_pp).all()
True

# and compare it to @Yakym Pirozhenko's approach
>>> from timeit import timeit
>>> print("yp:", timeit("np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]", globals=globals()))
yp: 3.9105667411349714
>>> print("pp:", timeit("(unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]", globals=globals()))
pp: 2.2684884609188884

These optimizations are easy to spot if we observe two things:

(1) if A is an mxn-matrix and b is an n-vector then

A * b == A @ diag(b)
A.T * b[:, None] == diag(b) @ A.T

(2) if A is anmxn-matrix and I is a k-vector of integers from range(m) then

A[I] == onehot(I) @ A

onehot can be defined as

def onehot(I, m, dtype=int):
    out = np.zeros((I.size, m), dtype=dtype)
    out[np.arange(I.size), I] = 1
    return out

Using these facts and abbreviating uv, im, oa1 and oa2 we can write

uv[im] * oa1 @ oa2 == onehot(im) @ uv @ diag(oa1) @ oa2

The above optimizations are now simply a matter of choosing the best order for these matrix multiplications which is

onehot(im) @ (uv @ (diag(oa1) @ oa2))

Using (1) and (2) backwards on this we obtain the optimized expression from the beginning of this post.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99