7

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

The test script is:

import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp


def main():
    print("Started.")

    print("numpy:", np.__version__)
    print("sklearn:", sklearn.__version__)

    n_samples = 1000000
    n_features = 1000

    X_train = np.random.randn(n_samples, n_features)
    y_train = np.random.randint(0, 2, size=n_samples)

    print("input data size: %.3fMB" % (X_train.nbytes / 1e6))

    model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
    param_grid = [{
        'alpha' : 10.0 ** -np.arange(1,7),
        'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
    }]
    gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
    gs.fit(X_train, y_train)
    print(gs.grid_scores_)

if __name__=='__main__':
    mp.freeze_support()
    main()

This results in the output:

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 240, in save
    obj, filename = self._write_array(obj, filename)
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 203, in _write_array
    self.np.save(filename, array)
  File "C:\Anaconda\lib\site-packages\numpy\lib\npyio.py", line 453, in save
    format.write_array(fid, arr)
  File "C:\Anaconda\lib\site-packages\numpy\lib\format.py", line 406, in write_array
    array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written

Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:\users\laszlos\appdata\local\temp\4\joblib_memmaping_pool_6172_78765976\6172-284752304-75223296-0.pkl
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
  File "S:\laszlo\gridsearch_largearray.py", line 33, in <module>
    main()
  File "S:\laszlo\gridsearch_largearray.py", line 28, in main
    gs.fit(X_train, y_train)
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 597, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Anaconda\lib\site-packages\sklearn\grid_search.py", line 379, in _fit
    for parameters in parameter_iterable
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 651, in __call__
    self.retrieve()
  File "C:\Anaconda\lib\site-packages\sklearn\externals\joblib\parallel.py", line 503, in retrieve
    self._output.append(job.get())
  File "C:\Anaconda\lib\multiprocessing\pool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'i' format code

EDIT: ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)

László
  • 3,914
  • 8
  • 34
  • 49
  • 1
    can you post the version of Anaconda you are using? If possible, post the parallel.py and pool.py. It may be that you are creating too many parallel processes. Anyway, this kind of thing is more suitable to be a bug report. – eswarp25 Jun 25 '14 at 11:19
  • @eswarp25 Thanks, it would be great if I could get it work, and not just file a bug report. I am using Anaconda 2.0.1, and I can post the .py files, but how exactly? Link to the files in my own Dropbox? – László Jun 25 '14 at 11:31
  • 1
    can you post the version of python and scikit-learn you are using? It would be helpful to look at the code for that particular version in the official repositories. You would not need to post the files then. – eswarp25 Jun 25 '14 at 11:33
  • @eswarp25 Of course. I am using Python 2.7.7. The only version number I see relevant to parallel.py is scikit-learn-0.15.0b1-np18py27_0. For pool.py, I see no other relevant version number than this is Anaconda 2.0.1 for Win64. I added more info to the question too. – László Jun 25 '14 at 11:37
  • 1
    This sounds like a serious bug in joblib however I cannot reproduce it by running the following script: https://gist.github.com/ogrisel/dab225fc9d0a365119b6 Can you please provide a standalone script with random data that can trigger the bug on your machine? Which numpy version do you have? – ogrisel Jun 25 '14 at 13:39
  • @ogrisel I will try with random data and let you know. I am using numpy 1.8.1 from http://repo.continuum.io/pkgs/pro/win-64/ – László Jun 25 '14 at 13:42
  • 1
    As you are on windows, don't forget to put a `if __name__ == '__main__':` block to protect the section that calls to scikit-learn that involves `n_jobs=8` . – ogrisel Jun 25 '14 at 13:53
  • 1
    FYI I could just run the code from my previous gist on a big windows box without crash. – ogrisel Jun 25 '14 at 13:54
  • @ogrisel I just did, after crashing first. I learnt this lesson a few days ago: http://stackoverflow.com/a/24374798/938408 – László Jun 25 '14 at 14:04
  • @ogrisel Same here with your example. I try more observations. Otherwise it is hard to fix if it only misbehaves with my data… – László Jun 25 '14 at 14:05
  • @ogrisel OK, it quickly crashed with ten times as much data. Can you replicated that? Now it starts with the ValueError (10**9 requested, 268435456 written), and then the same second error block from above. – László Jun 25 '14 at 14:10
  • I don't have enough memory on my box to do that. Can you set `verbose=100` and update the your question with the new script you used and error message you get? – ogrisel Jun 25 '14 at 14:20
  • 1
    Actually please open an issue on sklearn's tracker on http://github.com/scikit-learn/scikit-learn/issues – ogrisel Jun 25 '14 at 14:21
  • @ogrisel Sure. Do you see any chance for a quick fix? Anything I can do? How much slower is 0.14 if I downgrade? (I heard about the orders of magnitude speed-up.) – László Jun 25 '14 at 14:24
  • 1
    Do you see any chance for a quick fix? => not sure at all. How much slower is 0.14 if I downgrade? (I heard about the orders of magnitude speed-up.): It depends on the models, in your case probably not much. – ogrisel Jun 25 '14 at 14:34
  • @ogrisel By the way, were these parts touched in the 0.15 update at all? I am not sure why I supposed that the downgrade is worth trying. Thanks again for everything. – László Jun 25 '14 at 14:48
  • 1
    Yes the automatic memory-mapping was introduced in 0.15.0b1. – ogrisel Jun 25 '14 at 18:56
  • @ogrisel Because of this, the same code does not run scikit 0.14. I am unsure how much extra work it is to recode. `File "…\grid_search.py", line 707, in fit return self._fit(X, y, ParameterGrid(self.param_grid)) File "…\grid_search.py", line 493, in _fit for parameters in parameter_iterable File "…\externals\joblib\parallel.py", line 519, in __call__ self.retrieve() File "…\externals\joblib\parallel.py", line 419, in retrieve self._output.append(job.get()) File "…pool.py", line 558, in get raise self._value SystemError: NULL result without error in PyObject_Call` – László Jun 26 '14 at 09:11
  • 1
    It crashes because the worker got killed by the operating system because it uses too much memory as in 0.14 the full dataset will be replicated in memory once for each worker. Avoiding the replication is done in 0.15 by trying to automatically memory map the data. This automated memory mapping crashes on your system, hence you should try to do it manually as I replied. – ogrisel Jun 26 '14 at 09:34

1 Answers1

11

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

Edit #1: Here is the important part:

from sklearn.externals import joblib

joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • @ogrisel This solves the problem in the sense that the original crash is circumvented. However, the system ground to a halt, and when I could finally bring a Resource Monitor up, I see a trivial amount of RAM usage, but very happy HDD usage. And soon after that, I got a new Memory Error that I am adding to the question, though maybe it would be worth another thread. Any thoughts on this? Thanks for your time! – László Jun 26 '14 at 11:51
  • I edited my answer to point that you should not use the 32bit version of Anaconda. – ogrisel Jun 26 '14 at 12:13
  • @ogrisel A quick note, the problem arises with other objects, not only GridSearchCV, namely ElasticNetCV, but the workaround works for that too. – László Jun 30 '14 at 11:06
  • I did some further benchmarking and it seems that Windows handling of mmap is doing heavy usage of the paging file for some reason (this was tested in a Rackspace VM as I don't have a large windows server at hand). Linux seems to be much more efficient with large mmap'ed data on similar hardware. – ogrisel Jun 30 '14 at 13:15
  • 1
    The problem under windows stemmed from the use of `mmap_mode='c'` (copy on write). With `mmap_mode='r'` it works much better (without paging in the background). I changed the default in joblib master. – ogrisel Jun 30 '14 at 20:31