-2

I am writing a module to train a ML model on a large dataset - It includes 0.6M datapoints, each of 0.15M dimensions. I am facing problem with loading the data set itself. (its all numpy arrays)

Below is a code snippet (This replicates major behaviour of the actual code):

import numpy
import psutil

FV_length = 150000
X_List = []
Y_List = []

for i in range(0,600000):
    feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
    # using db data, mark the features to activated 
    class_label = 0
    X_List.append(feature_vector)
    Y_List.append(class_label)

    if (i%100 == 0):
        print(i)
        print("Virtual mem %s" %(psutil.virtual_memory().percent))
        print("CPU usage %s" %psutil.cpu_percent())

X_Data = np.asarray(X_List)
Y_Data = np.asarray(Y_List)

The code results in ever-increasing memory allocation, until it gets killed. Is there a way to reduce the ever-increasing memory allocation ?

I have tried using gc.collect() but it always returns 0. I have made variables = None explicitly, no use again.

Anuj Gupta
  • 6,328
  • 7
  • 36
  • 55
  • 6
    What behaviour do you expect? You create a new vector of length FV_length every time you go around the loop and store it in a list. This will lead to increasing memory allocation. What is the total memory allocation you expect at the end of the loop? – Conor Oct 12 '15 at 12:56
  • @Conor : I am using PyBrain to train a neural net. My feature vectors are of 0.15 M dimension. I understand that I am creating new vector and adding them to list iteratively, so increasing memory allocation. I run this code on AWS 8GB machine. I want to understand is there a better way to write this code ? – Anuj Gupta Oct 12 '15 at 13:02
  • You're trying to store 90 billion ints in memory at the same time. Obviously that won't fit in 8GB of memory. I don't know what kind of answer you expect here, as we don't know anything about your requirements. – interjay Oct 12 '15 at 13:11
  • 1
    With an `FV_length` of 150000 and 600000 iterations your final list will contain 90000000000 elements. Assuming you're using 64bit Python, each `np.int` element will be 8 bytes, so you would need 720000000000 bytes or 720GB just to store the elements in the final list. – ali_m Oct 12 '15 at 13:11
  • @Conor, interjay, ali_m: I agree with your reasoning, but interestingly same code on my mac (4GB) runs through well (slow but never breaks). Thats the wired part. – Anuj Gupta Oct 12 '15 at 14:47
  • The reason why it succeeds on your mac is because OSX does [lazy memory allocation](http://stackoverflow.com/a/27582592/1461210). When you initialize a new array using `np.zeros` the OS does not immediately allocate any physical memory for it, but instead waits until those memory addresses are written to. Since you're only initialising arrays and never writing to them, the calls to `np.zeros` can still succeed even after you've created arrays that exceed the total physical memory available. This is known as overcommit. Obviously it doesn't help if you want to actually write to the arrays. – ali_m Oct 13 '15 at 00:06

2 Answers2

1

As noted in the comments, the data volume here is just very large, and the Neural Network would probably struggle even if you managed to load the training set. The best option for you is probably looking into some method of dimensional reduction on your datapoints. Something like Principal Component Analysis could help to get the 150K dimensions down to a more reasonable number.

Conor
  • 1,028
  • 1
  • 8
  • 15
-1

This is what I did for a similar problem. I just always create the empty list again when it should be overwritten.

#initialize

X_List = [] 
Y_List = []


//do something with the list

Now if you don't need the old values, just create the list again

X_List = [] 
Y_List = []

But I don't know if this is needed or possible in your case. Maybe its the most idiomatic way but it worked.

Deepend
  • 4,057
  • 17
  • 60
  • 101