4

for some reason RandomForestClassifier.fit from sklearn.ensemble uses only 2.5GB RAM on my local machine but almost 7GB on my server with absolutely same training set.

The code without imports is pretty much this:

y_train = data_train['train_column']
x_train = data_train.drop('train_column', axis=1)

# Difference in memory consuming starts here
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf = clf.fit(x_train, y_train)
preds = clf.predict(data_test)

My local machine is macbook pro with 16GB of memory and 4 core CPU My server is Ubuntu server on digitalocean cloud with 8 GB of memory and 4 core CPU too.

Version of sklearn is 0.18, Python version is 3.5.2

I can't even imagine possible reasons, any help will be very helpful.

UPDATE

Memory Error appears in this code inside the fit method:

# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                 backend="threading")(
    delayed(_parallel_build_trees)(
        t, self, X, y, sample_weight, i, len(trees),
        verbose=self.verbose, class_weight=self.class_weight)
    for i, t in enumerate(trees))

UPDATE 2

Informaiton about my systems:

# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

Also my numpy configs:

# server
>>> np.__config__.show()
blas_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
blas_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c


# local
>>> np.__config__.show()
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blas_mkl_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3']
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE

Repr of clf object is the same on both machines:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)
valignatev
  • 6,020
  • 8
  • 37
  • 61
  • One thing that comes to mind is that you probably have very different BLAS, LAPACK, and other C libraries installed on your Macbook compared to your VPS. Falling back on a different algorithm (or even a different implementation of the same algorithm) to compute something used when fitting a model to your data could certainly explain the memory blow up. The first place to look is at your [numpy information](http://stackoverflow.com/a/9002656/2503352) on the respective systems. – gbe Oct 27 '16 at 20:48
  • Hm, my numpy is the same on both machines. The only significant difference is GCC as I see. – valignatev Oct 27 '16 at 20:56
  • It's not the version of numpy itself, but the libraries numpy is compiled against that's likely to be at play here. It may also be that difference in GCC versions. – gbe Oct 27 '16 at 21:02
  • 1
    I got it, just updated question with it – valignatev Oct 27 '16 at 21:02
  • The big difference I see is that your Ubuntu libraries weren't compiled against SSE3 and your Apple libraries are compiled against a couple of their libraries. SSE3 causing memory usage blow up seems dubious to me, though I'm not an expert in this field. The Apple libraries I'm not familiar with, but I could see being responsible for the problem. – gbe Oct 27 '16 at 21:12
  • The thing is that on mac it shows great result on RAM and it blows only on server. May be I should try to rebuild numpy with same libraries as on mac if it's even possible. It could be some proprietary libraries. – valignatev Oct 27 '16 at 21:15
  • could you print your RandomForestClassifier object on both machines before calling fit? – lejlot Oct 27 '16 at 21:21
  • updated the question – valignatev Oct 27 '16 at 21:28

3 Answers3

1

One possible explanation is that your server uses older scikit-learn. it was an issue while ago that sklearn RF was extremely memory hungry, which has been fixed in 0.17 if I recall correctly.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Thanks for your help, but I use same versions of scikit-learn, scipy and numpy. I'll update my question with this info – valignatev Oct 27 '16 at 20:55
1

Well, issue magically gone after I updated kernel from 3.13.0-57 to 4.4.0-28. Now it eats even less memory than my local mac laptop.

valignatev
  • 6,020
  • 8
  • 37
  • 61
0

I'm not sure this is the reason, but OS X has memory compression enabled by default; on Linux zRam / zswap / zcache are optional and not default (see https://en.wikipedia.org/wiki/Virtual_memory_compression).

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65