0

So I have over 30 million objects that I need to use as my training data. My Issue is simple: When I create my training array by an iterative appending process, at a certain threshold, the list becomes too large and python gets killed. What is a way to get around this? I have been trying to figure this out for hours and keep coming up short!

Code example for creating training array

training_array = []
for ...:
    data = #load data from somewhere
    data_array = [x for x in data] #some large array, 2-3 million objects  
    for item in data_array:
        training_array.append(item.a + item.b)

after a while, "killed" is printed to the consol and python exits. How can I avoid this?

more specific phrasing of the question:

I am trying to train on a very very large array, but in the making of the array, python gets killed. This training algorithm cannot be trained on chunks of data, but needs one full array, which limits the only way I knew how to surpass this issue. Is there another way to create this array without using all my RAM (if that is the actual issue)?

Community
  • 1
  • 1
Ryan Saxe
  • 17,123
  • 23
  • 80
  • 128
  • *What is a way to get around this?*. Make smaller array. – vaultah Aug 22 '14 at 19:44
  • but in practice people train on much larger data sets than the one I am creating so there must be a way to train on an array of that size or break it up in some regard so that python can support it – Ryan Saxe Aug 22 '14 at 19:45
  • 1
    See http://stackoverflow.com/a/5537764/154762 – solarc Aug 22 '14 at 19:46
  • Does this hellp:- http://www.stavros.io/posts/optimizing-python-with-cython/ – Rahul Tripathi Aug 22 '14 at 19:47
  • That depends on your hardware and what kind of method you're using to train. One solution to this problem is just to buy more RAM. Or you can look into using some kind of [online trainer](http://en.wikipedia.org/wiki/Online_machine_learning). Also, against depending on the model, you might be able to take advantage of [sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html). – Roger Fan Aug 22 '14 at 19:48
  • I'm not trying to optimize this, just getting it to work...@solarc, that could work, I'll try that – Ryan Saxe Aug 22 '14 at 19:50
  • What sort of objects are you dealing with? If you can build the entries of the training array one at a time and save them to file you might have better luck loading the file in all at once instead of growing the list one item at a time. – Robb Aug 22 '14 at 20:05

2 Answers2

4
  • Is data a Python list? If it is, then

    data_array = [x for x in data]
    

    is unnecessary, since it is the same as saying

    data_array = list(data)
    

    which makes a copy of data. That doubles the amount of memory required, but it is not clear what purpose this serves.

    Also note that you can del data to allow Python to reclaim the memory used by data when it is no longer needed.

  • On the other hand, perhaps data is an iterator. If that's the case, then you could save memory by avoidng the creation of the Python list, data_array. In particular, you don't need data_array to define training_array. You could replace

    data_array = [x for x in data] #some large array, 2-3 million objects  
    for item in data_array:
        training_array.append(item.a + item.b)
    

    with the list comprehension

    training_array = [x.a + x.b for x in data]
    
  • If you are using NumPy and ultimately want training_arary to be a NumPy array, then you can save even more memory by avoiding the creation of the intermediate Python list, training_array. You define the NumPy array, training_data directly from data:

    training_array = np.fromiter((x.a + x.b for x in data),
                                 dtype=...)
    

    Note that (x.a + x.b for x in data) is a generator expression, thus avoiding the much larger amount of memory required had we used a list comprehension here.

    If you know the length of data, adding count=... to the call to np.fromiter will speed up its performance, since it will allow NumPy to pre-allocate the right amount of memory for the final array.

    You'll also have to specify the correct dtype. If the values in training_array are floats, you can save memory (at the expense of precision) by specifying a dtype with a smaller itemsize. For example, dtype='float32' stores each float in the array using 4 bytes (i.e. 32 bits). Normally NumPy uses float64, which are 8-byte floats. So you can create a smaller array (and thus save memory) by using a smaller dtype.

  • If you are still running short of memory, then you could use np.memmap to create a file-based array instead of a memory-based array. Other options in the same vein include using h5py or pytables to create a hdf5 file.
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

There are a few things you can do:

  1. Work with your data in chunks - This will keep you from having a monster array at any given time, and should help keep overhead down.

  2. Produce your data from an generator- Generators are "lazily" evaluated, meaning, they don't exist all at once. Each element is created when you call for it, not sooner, keeping you from having a monster array. Generators can be a bit tricky to figure out if you aren't familiar with them, but there are tons of resources around.

For your specific problem, try this generator:

def train_gen(data):
    data_gen = (x for x in data)  #The () here are important as it makes data_gen a generator as well, as opposed to a list
    for item in data_gen:
        yield item.a + item.b


data = #load data from somewhere
training_array = train_gen(data)

for item in training_array:
    #Iterates through training_array, producing one value, then discarding it such that only one item in training_array is in memory at a time
wnnmaw
  • 5,444
  • 3
  • 38
  • 63
  • These solutions really depend on what he wants to use the array for though. The majority of models don't handle chunked or observations-by-observation data. – Roger Fan Aug 22 '14 at 19:49
  • I cannot use chunks, please see the edited question. It is not supported by the training model. As far as a generator goes, I would need to convert the generator back into a list and that would simply give me the same error I am getting now (I believe at least) – Ryan Saxe Aug 22 '14 at 19:56