7

I have a code that generates me within a for loop two numpy arrays (data_transform). In the first loop generates a numpy array of (40, 2) and in the second loop one of (175, 2). I want to concatenate these two arrays into one, to give me an array of (215, 2). I tried with np.concatenate and with np.append, but it gives me an error since the arrays must be the same size. Here is an example of how I am doing the code:

result_arr = np.array([])

for label in labels_set:
    data = [index for index, value in enumerate(labels_list) if value == label]
    for i in data:
        sub_corpus.append(corpus[i])
    data_sub_tfidf = vec.fit_transform(sub_corpus) 
    data_transform = pca.fit_transform(data_sub_tfidf) 
    #Append array
    sub_corpus = []

I have also used np.row_stack but nothing else gives me a value of (175, 2) which is the second array I want to concatenate.

Luis Miguel
  • 193
  • 1
  • 2
  • 10
  • You assign `result_arr` at the start. Why? Then in the loop you assign it again - but don't use it as an argument for `row_stack`. Are you trying to imitate a list `append` loop? – hpaulj Sep 24 '19 at 15:58
  • @hpaulj I try to create an empty array to fill it. I just want to do the operation that I do when I append values to a `list`. – Luis Miguel Sep 24 '19 at 16:05
  • Stick with the list append; don't try to imitate it with arrays. Make the array in one step, at the end. – hpaulj Sep 24 '19 at 16:06
  • `alist.append(x)` operates in-place on `alist`. `np.row_stack(data_transform)` returns a new array. It doesn't not use or operate on `result_arr`, which does not appear at all in that expression. The `result_arr=...` step just replaces the previous value with a new one. The syntax is totally different from the list code. – hpaulj Sep 24 '19 at 16:24
  • @hpaulj I know but I want to perform the operation that `alist.append()` does, but with a `numpy`. – Luis Miguel Sep 24 '19 at 16:50
  • List append adds a pointer/reference to itself. There's isn't anything equivalent for arrays. To join two arrays you have to make a new one. That's more expensive than the simpler list append. And you have to pay close attention to the shape of the respective arrays - `concatenate` (and the stack variants) is quite picky about that. – hpaulj Sep 24 '19 at 16:55

3 Answers3

13

What @hpaulj was trying to say with

Stick with list append when doing loops.

is

#use a normal list
result_arr = []

for label in labels_set:

    data_transform = pca.fit_transform(data_sub_tfidf) 

    # append the data_transform object to that list
    # Note: this is not np.append(), which is slow here
    result_arr.append(data_transform)

# and stack it after the loop
# This prevents slow memory allocation in the loop. 
# So only one large chunk of memory is allocated since
# the final size of the concatenated array is known.

result_arr = np.concatenate(result_arr)

# or 
result_arr = np.stack(result_arr, axis=0)

# or
result_arr = np.vstack(result_arr)

Your arrays don't really have different dimensions. They have one different dimension, the other one is identical. And in that case you can always stack along the "different" dimension.

Joe
  • 6,758
  • 2
  • 26
  • 47
  • Is there actually a difference between `np.concatenate` and `np.stack` here? concatenate seems to row them column-wise into the matrix, while stack seems to stack them row-wise on top of eachother.. – MJimitater Jan 06 '21 at 08:54
  • What if you want to vertically concatenate multiple 2D arrays in a for loop without using np.vstack()? – VMMF Aug 17 '22 at 12:12
  • How do you make sure it is not np append? – Ana Aug 09 '23 at 15:16
2

Using concatenate, initializing "c":

a = np.array([[8,3,1],[2,5,1],[6,5,2]])
b = np.array([[2,5,1],[2,5,2]])
matrix = [a,b]

c = np.empty([0,matrix[0].shape[1]])

for v in matrix:
    c = np.append(c, v, axis=0)

Output:

[[8. 3. 1.]
 [2. 5. 1.]
 [6. 5. 2.]
 [2. 5. 1.]
 [2. 5. 2.]]
Manuel
  • 698
  • 4
  • 8
  • 1
    Repeated `concatenate` is slow. Stick with list append when doing loops. – hpaulj Sep 24 '19 at 16:07
  • I turned your comment into an answer, hope you don't mind :) – Joe Sep 24 '19 at 16:59
  • @Manuel, `np.append()` and `np.concatenate()` are both slow when used in a for loop, they basically do the same thing behind the scenes. – Joe Sep 25 '19 at 10:40
0

If you have an array a of size (40, 2) and an array b of size (175,2), you can simply have a final array of size (215, 2) using np.concatenate([a,b]).

FrankyBravo
  • 438
  • 1
  • 4
  • 12
  • I know. The problem is that I don't have two arrays, only one. The variable `data_transform` change in every loop. – Luis Miguel Sep 24 '19 at 16:20