2

I have a list of numpy arrays. The list has 200000 elements and each array is of size 3504. This works fine in my RAM. type(x)

(Pdb) type(x)
<type 'list'>
(Pdb) len(x)
200001
(Pdb) type(x[1])
<type 'numpy.ndarray'>
(Pdb) x[1].shape
(3504L,)

The problem is that now I convert the list to numpy array and it exceeds by RAM 100% usage and freezes/crashes my PC. My intent to convert is to perform some feature scaling and PCA.

EDIT: I want to convert each sample to concatenate array of earlier 1000 samples plus itself.

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    final_train_set=[]
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        final_train_set.append(cd_i)
    return final_train_set,y


x,y=take_previous_data(X_train,y)
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • 1
    Why don't you read your data as a `numpy.array` in the first place? – Eli Korvigo Aug 25 '15 at 18:28
  • I am appending `numpy.arrays` to a list, which efficient than appending to a `numpy array`. – Abhishek Bhatia Aug 25 '15 at 18:29
  • 1
    Perhaps you could consider single precision or a smaller integer type – Jens Munk Aug 25 '15 at 18:32
  • 1
    Python lists are much less efficient than numpy arrays. By converting `x` to numpy array you are duplicating the memory, which is probably why it crashes. There are many ways (much more efficient than using list) to initialize your data as numpy arrays. Where are you reading your *appended numpy arrays* from? I mean, the problem is not that numpy crashes, the problem is that your *reading data logic* is what needs to be improved. – Imanol Luengo Aug 25 '15 at 18:38
  • @imaluengo Thanks for comment! please check the edit. – Abhishek Bhatia Aug 25 '15 at 18:45
  • What is the dtype of the arrays? How much RAM do you have? – ali_m Aug 25 '15 at 18:58
  • @ali_m I have `12 GB RAM` and `int type array`. – Abhishek Bhatia Aug 25 '15 at 19:00
  • 1
    Indeed, appending to a `list` takas `O(1)` amortised, but you don't have to append in the first place. You can make a lazy generator and give it to `numpy.fromiter` while specifying data type and shape. This way you'll get your array without any intermediate data structures. – Eli Korvigo Aug 25 '15 at 19:03
  • @EliKorvigo elucidate more please, I am pretty new to numpy and python – Abhishek Bhatia Aug 25 '15 at 19:05

1 Answers1

2

You could try rewriting take_previous_data as a generator function that lazily yields rows of your final array, then use np.fromiter, as Eli suggested:

from itertools import chain

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        yield cd_i

gen = take_previous_data(X_train, y)

# I'm assuming that by "int" you meant "int64"
x = np.fromiter(chain.from_iterable(gen), np.int64)

# fromiter gives a 1D output, so we reshape it into a (200001, 3504) array
x.shape = 200001, -1

Another option would be to pre-allocate the output array and fill in the rows as you go along:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.empty((200001, 3504), np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

From our chat conversation, it seems that the fundamental issue is that you can't actually fit the output array itself in memory. In that case, you could adapt the second solution to use np.memmap to write the output array to disk:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.memmap('my_array.mmap', 'w+', shape=(200001, 3504), dtype=np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

One other obvious solution would be to reduce the bit depth of your array. I've assumed that by int you meant int64 (the default integer type in numpy). f you could switch to a lower bit depth (e.g. int32, int16 or maybe even int8), you could drastically reduce your memory requirements.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Thanks! You didn't reshape though, please check. – Abhishek Bhatia Aug 25 '15 at 19:24
  • 1
    Yes I did. You can reshape an array in place by assigning to its `.shape` attribute. The -1 means to infer the size of the array in that dimension, based on the total number of elements. – ali_m Aug 25 '15 at 19:25
  • 1
    I believe `cd_i` is a sequence, hence you need to call `np.fromiter(itertools.chain(*gen), dtype=np.int64)` for `np.fromiter` to work, because it only accepts 1D data streams. I haven't slept for quite a while, so I can be wrong. – Eli Korvigo Aug 25 '15 at 19:33
  • @EliKorvigo Good spot. – ali_m Aug 25 '15 at 19:37
  • @ali_m Thanks for the continued support! Didn't about reshaping this way. I am getting error but `NameError: "name 'chain' is not defined" ` – Abhishek Bhatia Aug 25 '15 at 19:44
  • 1
    Did you see the line `from itertools import chain`? – ali_m Aug 25 '15 at 19:45
  • @ali_m Thanks, missed that. It is still crashing my memory with `200000`. – Abhishek Bhatia Aug 25 '15 at 21:57
  • Have you tried both options? Perhaps you simply don't have RAM to accommodate the output array. A `(200001, 3504)` array of int64 should fill 5.6GB, but I don't know what else you're holding in memory. You could use [`np.memmap`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) to write the output array to disk if it's too big. – ali_m Aug 25 '15 at 22:05
  • @ali_m Just to clarify, is there a difference the two approaches in terms of efficiency. – Abhishek Bhatia Aug 25 '15 at 22:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/87934/discussion-between-ali-m-and-abhishek-bhatia). – ali_m Aug 25 '15 at 22:11
  • @ali_m Hi, Thanks for the amazing answer! I am trying your` memmap` approach . But seems to crash my PC. It write 37GB `memmap`. Can you please explain how your code flushes it out of the memory to disk?I am bit confused the doc mentions `Memory-mapped arrays use the Python memory-map object which (prior to Python 2.5) does not allow files to be larger than a certain size depending on the platform. This size is always < 2GB even on 64-bit systems.` – Abhishek Bhatia Aug 26 '15 at 07:47
  • @ali_m My question is while running on computer it is still storing it in the memory. It flushes to disk only after closing the program, which defeats it's purpose. Does this make sense? – Abhishek Bhatia Aug 26 '15 at 08:21
  • [See here](http://stackoverflow.com/a/20713130/1461210). The memory usage of an `np.memmap` array is handled by the OS. Normally, any changes to the array are initially cached in RAM, then written to disk when the OS decides this is necessary. You should expect to see your memory usage increase up to a point, but the OS should not allow the write buffer to exceed the total amount of physical memory available. If you're still running out of RAM completely then it's probably nothing to do with `np.memmap`. – ali_m Aug 27 '15 at 00:33
  • Thanks @ali_m for the answer! How can I force it to write to the disk and remove from the memory? Using `Windows 10`. .I want to do after the for loop is complete to avoid later memory overflows while processing. – Abhishek Bhatia Aug 27 '15 at 15:22
  • `for index,row in enumerate(temp_train_data): .... out[index] = cd_i;out.flush()`. Does this make sense? – Abhishek Bhatia Aug 27 '15 at 16:00
  • I want to free the memory completely. Using `out.flush` and `del out` are different. Don't understand why though. – Abhishek Bhatia Aug 27 '15 at 17:08
  • Please ask this as a separate question. I'm not going to answer any more questions posed here, since the comments are getting out of hand. – ali_m Aug 27 '15 at 17:12
  • Makes sense. Please check http://stackoverflow.com/questions/32255818/flushing-memmap-completely-to-disk. – Abhishek Bhatia Aug 27 '15 at 17:27