Use information of two arrays to create a third one

Question

I have two numpy-arrays and want to create a third one with the information in these twos. Here is a simple example:

have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]]) 

solution = np.array([[1, 1, 3, 4], [5, 5, 5, 8]])

What I want is to use the "use"-array, which gives me the number of how often I want to use the first element in each row from my "have"-array. So the 2 in "use" means, that I want to have two times a "1" in my new array "solution". Similary for the "3" in use, I want that my new array has 3 times a "5". The rest from have should be the same. It is important to use the "use"-array for doing this (or a numpy-array in general).

Do you have some ideas?

score 1 · Answer 1 · answered Apr 20 '21 at 06:13

1

If there are only small such data structures and performance is not an issue then you can do this so simple:

np.array([ [a[0]]*b[0]+list(a[b[0]:]) for a,b in zip(have,use)])

answered Apr 20 '21 at 06:13

quantummind

2,086
1
14
20

Vvvvvv · Answer 2 · 2021-04-20T08:07:12.260

If performance matters, you can use np.apply_along_axis().

import numpy as np

have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]])


def rep1st(arr):
    rep = arr[0]
    res = np.repeat(arr[1], rep)
    res = np.concatenate([res, arr[rep+1:]])
    return res


solution = np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))

update:

As @hpaulj said, actually the method using apply_along_axis above is not as efficient as I expected. I misunderstood it. Reference: numpy np.apply_along_axis function speed up?.

However, I made some test on current methods:

import numpy as np
from timeit import timeit


def rep1st(arr):
    rep = arr[0]
    res = np.repeat(arr[1], rep)
    res = np.concatenate([res, arr[rep + 1:]])
    return res


def test(row, col, run):
    have = np.random.randint(0, 100, size=(row, col))
    use = np.random.randint(0, col, size=(row, 1))
    d = locals()
    d.update(globals())
    # method by me
    t1 = timeit("np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))", number=run, globals=d)
    # method by @quantummind
    t2 = timeit("np.array([[a[0]] * b[0] + list(a[b[0]:]) for a, b in zip(have, use)])", number=run, globals=d)
    # method by @Amit Vikram Singh
    t3 = timeit(
        "np.where(np.repeat(np.arange(have.shape[1])[None, :], have.shape[0], axis=0) < use, have[:, 0:1], have)",
        number=run, globals=d
    )
    print(f"{t1:8.6f}, {t2:8.6f}, {t3:8.6f}")


test(1000, 10, 10)
test(100, 100, 10)
test(10, 1000, 10)

test(1000000, 10, 1)
test(100000, 100, 1)
test(10000, 1000, 1)
test(1000, 10000, 1)
test(100, 100000, 1)
test(10, 1000000, 1)

results:

0.062488, 0.028484, 0.000408
0.010787, 0.013811, 0.000270
0.001057, 0.009146, 0.000216

6.146863, 3.210017, 0.044232
0.585289, 1.186013, 0.034110
0.091086, 0.961570, 0.026294
0.039448, 0.917052, 0.022553
0.028719, 0.919377, 0.022751
0.035121, 1.027036, 0.025216

It shows that the second method proposed by @Amit Vikram Singh always works well even when the arrays are huge.

Have you actually timed this? In my experience `apply...` is not a performance tool. — hpaulj, Apr 20 '21 at 06:51
@hpaulj, you are right and thanks for your reply. I have mistaken this. — Vvvvvv, Apr 20 '21 at 08:23

Amit Vikram Singh · Answer 3 · 2021-04-20T06:31:48.790

Simply iterate through the have and replace the values based on the use.

Use:

for i in range(use.shape[0]):
    have[i, :use[i, 0]] = np.repeat(have[i, 0], use[i, 0])

Using only numpy operations:

First create a boolean mask of same size as have. mask(i, j) is True if j < use[i, j] otherwise it's False. So mask is True for indices which are to be replaced by first column value. Now use np.where to replace.

n, m = have.shape
mask = np.repeat(np.arange(m)[None, :], n, axis = 0) < use
have = np.where(mask, have[:, 0:1], have)

Output:

>>> have
array([[1, 1, 3, 4],
       [5, 5, 5, 8]])

Use information of two arrays to create a third one

3 Answers3

Linked