3

I have two big numpy 2D arrays. One shape is X1 (1877055, 1299), another is X2 (1877055, 1445). I then use

X = np.hstack((X1, X2))

to concatenate the two arrays into a bigger array. However, the program doesn't run and exit with code -9. It didn't show any error message.

What is the problem? How can I concatenate such two big numpy 2D arrays?

Excalibur
  • 431
  • 6
  • 19
  • Sample data? Floats or ints? Does this work on smaller arrays? At single precision, `X` is around 20gb, so depending on your RAM, this could be super slow in the best case scenario. – JohnE May 26 '15 at 01:30
  • There are three possible problems here: (1) You're using a 32-bit Python, and the 32-bit virtual memory space (2-4GB) isn't enough to fit `X1`, `X2, and `X` all into the 32-bit virtual memory space. (2) You're using a 64-bit Python, but you don't have enough actual memory to fit them all. (3) Your NumPy installation is broken. So, first, are you on a 32-bit Python? What `dtype`? How much memory do you have? And how did you install NumPy and what version do you have? And what platform? – abarnert May 26 '15 at 01:31
  • More simply, print out `X1.nbytes` and `X2.nbytes` and then we don't have to guess how much memory you're using. – abarnert May 26 '15 at 01:32
  • Anyway, this is almost certainly a memory error, but you haven't given us enough information to diagnose it. My first guess would be that you're on a linux system that has maybe 64-68GB of physical RAM+swap, and your dtype is float64 so you're trying to allocate another 38GB after you're already using 38GB, and the kernel lets you overcommit but then you segfault because you actually try to use all 76GB. – abarnert May 26 '15 at 01:40
  • Thanks @abarnert, the problem is the memory! – Excalibur May 26 '15 at 16:17

2 Answers2

8

Unless there's something wrong with your NumPy build or your OS (both of which are unlikely), this is almost certainly a memory error.

For example, let's say all these values are float64. So, you've already allocated at least 18GB and 20GB for these two arrays, and now you're trying to allocate another 38GB for the concatenated array. But you only have, say, 64GB of RAM plus 2GB of swap. So, there's not enough room to allocate another 38GB. On some platforms, this allocation will just fail, which hopefully NumPy would just catch and raise a MemoryError. On other platforms, the allocation may succeed, but as soon as you try to actually touch all of that memory you'll segfault (see overcommit handling in linux for an example). On other platforms, the system will try to auto-expand swap, but then if you're out of disk space it'll segfault.

Whatever the reason, if you can't fit X1, X2, and X into memory at the same time, what can you do instead?

  • Just build X in the first place, and fill X1 and X2 by filling sliced views of X.
  • Write X1 and X2 out to disk, concatenate on disk, and read them back in.
  • Send X1 and X2 to a subprocess that reads them iteratively and builds X and then continues the work.
abarnert
  • 354,177
  • 51
  • 601
  • 671
-3

Not an expert in numpy but, why not use numpy.concatenate()?

http://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html

For example:

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[5, 6]])
>>> np.concatenate((a, b), axis=0)
array([[1, 2],
   [3, 4],
   [5, 6]])
>>> np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
   [3, 4, 6]])
ederollora
  • 1,165
  • 2
  • 11
  • 29
  • 3
    `concatenate` and `hstack` do the exact same thing. In fact, the [`hstack`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html) docs even say they're equivalent. So this isn't going to help. – abarnert May 26 '15 at 01:27