for some reason RandomForestClassifier.fit
from sklearn.ensemble
uses only 2.5GB RAM on my local machine but almost 7GB on my server with absolutely same training set.
The code without imports is pretty much this:
y_train = data_train['train_column']
x_train = data_train.drop('train_column', axis=1)
# Difference in memory consuming starts here
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf = clf.fit(x_train, y_train)
preds = clf.predict(data_test)
My local machine is macbook pro with 16GB of memory and 4 core CPU My server is Ubuntu server on digitalocean cloud with 8 GB of memory and 4 core CPU too.
Version of sklearn is 0.18, Python version is 3.5.2
I can't even imagine possible reasons, any help will be very helpful.
UPDATE
Memory Error appears in this code inside the fit
method:
# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
backend="threading")(
delayed(_parallel_build_trees)(
t, self, X, y, sample_weight, i, len(trees),
verbose=self.verbose, class_weight=self.class_weight)
for i, t in enumerate(trees))
UPDATE 2
Informaiton about my systems:
# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18
# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18
Also my numpy configs:
# server
>>> np.__config__.show()
blas_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
openblas_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
lapack_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
blas_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
# local
>>> np.__config__.show()
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blas_mkl_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3']
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
Repr of clf
object is the same on both machines:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False)