I am writing a module to train a ML model on a large dataset - It includes 0.6M datapoints, each of 0.15M dimensions. I am facing problem with loading the data set itself. (its all numpy arrays)
Below is a code snippet (This replicates major behaviour of the actual code):
import numpy
import psutil
FV_length = 150000
X_List = []
Y_List = []
for i in range(0,600000):
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
# using db data, mark the features to activated
class_label = 0
X_List.append(feature_vector)
Y_List.append(class_label)
if (i%100 == 0):
print(i)
print("Virtual mem %s" %(psutil.virtual_memory().percent))
print("CPU usage %s" %psutil.cpu_percent())
X_Data = np.asarray(X_List)
Y_Data = np.asarray(Y_List)
The code results in ever-increasing memory allocation, until it gets killed. Is there a way to reduce the ever-increasing memory allocation ?
I have tried using gc.collect() but it always returns 0. I have made variables = None explicitly, no use again.