0

I need to stack numpy array vertically, which are a return value from function_returns_some_np_array. The function always returns an array of the same shape. In this case is length is 10. If I do not check if X is not empty, I get the following error.

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 10

The code including check:

    X = np.empty(0)
    if X.size == 0:
         X = function_returns_some_np_array(data)
    else:
         X = np.vstack((X, function_returns_some_np_array(data)))

Is it always necessary to check whether the array is not empty or is there a one line solution to handle this? Something build-in would be great. So to sum up the question. Is there maybe a shorter solution to this operation?

Thanks

Jürgen K.
  • 3,427
  • 9
  • 30
  • 66

1 Answers1

0

You example lacks some context, the iterative stacking of X suggests it happens in some sort of loop?

It's usually best avoid iterative stacking (or appending etc) like that, since it forces Numpy to create a new array each time.

If your use case looks something like:

func = lambda x: np.random.randn(x)
data = 10000

X = None

for _ in range(50):

    if X is None:
         X = func(data)
    else:
         X = np.vstack((X, func(data)))

I would simply collect the intermediate results in a list, and only stack once in the end. But that assumes you don't need that intermediate stack in your calculations. If you do, preallocating the "final" array initially, and inserting the results from the function could help, but that requires knowing the final size (n-iterations/calls).

So for example:

res = []
for _ in range(50):
    res.append(function_returns_some_np_array(data))

X = np.vstack(res)    

That removes the if-statement, making it a little easier to read, you only have an initialization "issue" in the beginning, not a real decision that needs the if-statement.

It's also about 3x faster compared to the top, but that really depends on the size of the returned array versus the amount of iterations.

Rutger Kassies
  • 61,630
  • 17
  • 112
  • 97
  • Thanks for your answer. The data is inside a for loop. Where do you have the information which one is how much faster? Its quiet an interesting insight – Jürgen K. Sep 07 '21 at 11:54
  • That was based on a simple benchmark I did using the above code and the `timeit` module. But that part of your question is a very fundamental problem in programming, and already much discussed for Python\Numpy specifically. See for example: https://stackoverflow.com/a/46103554/1755432 – Rutger Kassies Sep 07 '21 at 12:10