1

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.

Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

lte__
  • 7,175
  • 25
  • 74
  • 131
  • Have a look at this question: https://stackoverflow.com/questions/17710748/process-large-data-in-python – nalyd88 Sep 02 '17 at 14:12
  • 2
    Possible duplicate of [Process large data in python](https://stackoverflow.com/questions/17710748/process-large-data-in-python) – nalyd88 Sep 02 '17 at 14:13

1 Answers1

0

You will need to do it manually.

There are probably two different core-problems here:

  • A: holding your training-data
  • B: training the regressor

For A, you can try numpy's memmap which abstracts swapping away. As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.

For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.

Keep in mind, that this training-process decomposes into at least two new elements:

  • Efficient being in regards to memory
    • Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
  • Efficient convergence

Those algorithms in the link above should be okay for both.

SGDRegressor can be parameterized to resemble RidgeRegression.

Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!

Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

sascha
  • 32,238
  • 6
  • 68
  • 110