2

I have a dataset 1.4 million samples x 32 features.

I want to convert each sample to concatenate array of earlier 1000 samples plus itself. Since I don't have the earlier data for the first 1000 samples, I remove them. Thus, each sample has 1001*32 features after conversion. I use the code below but it crashes everytime, even on my 12GB RAM laptop. What am I doing wrong here. How can I make this computation feasible?

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    final_train_set=[]
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        final_train_set.append(X_train[actual_index-1000:actual_index+1].flatten())
    return  np.array(final_train_set),temp_labels

Note: Using Python 2.7

Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • suppose the data type is float, 1400000*1000*32*8/1024/1024/1024 = 333GB – yangjie Aug 23 '15 at 14:24
  • 1
    `crash` is a poor way of describing a problem. Give the error messge, and context (stacktrace) where possible. It helps to know exactly where in your code the problem is occuring. Also if the issue appears to be size related, tell us what data sizes do work. – hpaulj Aug 23 '15 at 18:41
  • @hpaulj There is no error. The python program exceeds my 100% memory usage and my computer freezes. – Abhishek Bhatia Aug 24 '15 at 10:42

2 Answers2

2

Remember that when you slice an array it actually returns a copy so that's already expensive X_train[1000:] y[1000:] But the most expensive piece is definitely this one: X_train[actual_index-1000:actual_index+1] I don't know what the exact size of X_train is but you're copying at least a 1000 elements..., and then doing another copy with flatten()

Something like that would take less memory, using a generator you will only have one copy of the thing in memory per iteration, instead of len(X_train) - 1000 copies.

import numpy as np

def train_generator(X_train):
    for index in xrange(1000, len(X_train)):
        yield X_train[index-1000:index+1].flatten()

def take_previous_data(X_train, y):
    return  np.array(train_generator(X_train)), y[1000:]


take_previous_data(['a'*100000000] * 2000, ['b'*100000000] * 2000) # passes easy on my 8GB laptop :)

I don't know what the goal of the code is but you could also look at the numpy methods to transforms arrays, that would probably be even more efficient.

Maresh
  • 4,644
  • 25
  • 30
  • 1
    Slices like that are views, not copies. `flatten` does return a copy (see it's doc). `x.flat` or `x.ravel` use views where possible. – hpaulj Aug 23 '15 at 18:38
  • I did check the doc for flatten(). `numpy.ndarray.flatten ndarray.flatten(order='C') Return a copy of the array collapsed into one dimension.` http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html I don't know about the slices though, standard python would copy. – Maresh Aug 23 '15 at 18:45
  • The original size of X_train is 1,400,000*32, after transform it would be 1,400,000*32032, that's the real problem – yangjie Aug 24 '15 at 01:03
  • @Maresh It returns me generator object. Consider I want to a simple PCA on it which doesn't exceed my memory. How can I do that? `clf=PCA(0.98,whiten=True) ; X_train=clf.fit_transform(X_train) ` – Abhishek Bhatia Aug 24 '15 at 10:59
  • Hmm, I afraid my answer was irrelevant for numpy arrays, check this: http://stackoverflow.com/questions/367565/how-do-i-build-a-numpy-array-from-a-generator you kind of need to preset the array and then you'd lose the benefit of using a generator... I guess you should have a look at sparse matrix http://docs.scipy.org/doc/scipy/reference/sparse.html , or figure out a way to do partial computation but that's beyond my knowledge. – Maresh Aug 24 '15 at 11:59
  • Many SO questions about 'train-sets' involve `scikit-learn` and its use of sparse matricies. – hpaulj Aug 24 '15 at 14:20
  • @hpaulj Can you give an example. – Abhishek Bhatia Aug 24 '15 at 16:33
  • A search: http://stackoverflow.com/search?tab=newest&q=[scikit-learn]%20sparse%20train – hpaulj Aug 24 '15 at 16:43
1

At least from what I understand you're trying to increase your data's volume by 1001% so unless you're working with less than 10-11MB you're going to end up with more than 12GB of data.

My suggestion would be to read the bits you need for each individual feature set computation from a file and then write the output to another file.

Using files to store the data you're not performing operations on should fix your ram problems.