1

I have a problem with spliting a np. array and list into two. Here is my code:

X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]
train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]
validation_y = y[int(len(X) * 0.9):]

This is a sample of code that prepares data for neural network. Works great, but generates "out of memory error" (i have 32GB on board):

Traceback (most recent call last):
  File "D:/Projects/....Here is a file location.../FileName.py", line 120, in <module>
    validation_x = np.array(X)[int(len(X) * 0.9):]
MemoryError

It seems like it keeps in memory list X and np.array y and duplicates it as separate variablest train_x, train_y, validation_x, validation_y. Do you know how to deal with this?

Shape of X:(324000, 256, 24)

Shape of y:(324000,10)

Shape of train_x: (291600, 256, 24)

Shape of train_y:(291600,10)

Shape of validation_x:(32400, 256, 24)

Shape of validation_y:(32400,10)

ketzul
  • 25
  • 1
  • 8
  • Are you trying to use files to store training data? You can pick each row to file and do not use ram. What do you think about that? https://stackoverflow.com/a/55474324/4510954 Moreover, you can use sklearn.model_selection.train_test_split to perform your split operation – ElConrado Apr 05 '19 at 05:56

2 Answers2

1
X = []
y = []
for seq, target in ConvertedData:
    X.append(seq)
    y.append(target)

X is a list of seq. I assume those are arrays. X just has pointers to those,

y = np.vstack(y)

train_x = np.array(X)[:int(len(X) * 0.9)]

Makes an array from X, and then a slice of that array. The full np.array(X) still exists in memory

train_y = y[:int(len(X) * 0.9)]
validation_x = np.array(X)[int(len(X) * 0.9):]

Makes another array from X. train_x and validation_x are views of separate arrays.

validation_y = y[int(len(X) * 0.9):]

Doing

X1 = np.array(X)
train_x = X1[:...]
validation_x = X1[...:]

will eliminate that duplication. Both are views of the same X1.

Another approach would be to slice the list first:

train_x = np.array(X[:...])
validation_x = np.array(X[...:])

My guess is that memory use, at least with in the arrays will be similar.

del X after creating the X1 might also help, allowing X and the arrays it references to be garbage collected.

But beware that once you start hitting a memory error at one point in your code, tricks like this might postpone it. Calculations can easily end up making copies, or temporary buffers, of comparable size.


Your split uses 2 slices; that results in views, which don't add to the original memory use. But if you make a shuffled split, the train and validation parts will be copies, and together take up as much memory as the source.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Hello, thank you. I used this solution with X1=np.array(X). It also hits max memory, but it takes less than one minute, so there is no memory error and the program is still going. By the way - after creating train and validation arrays I decided to delete arrays X and y and perform gc.collect(). After this memory usage is reduced to 6GB before machine learning started. Thanks! – ketzul Apr 06 '19 at 11:45
0

As described in answer of memory errors. You can pickle each array of trainig data to file like in this question.

You can split by train_test_split, it could be more efficient way of performing spliting.

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
ElConrado
  • 1,477
  • 4
  • 20
  • 46
  • I think the `sklearn` split takes a random (or shuffled) split, which will result in copies of the source, not views. – hpaulj Apr 05 '19 at 06:23