Description
I am running a python code using scikit-learn machine learning algorithm where the size of the input table is close to 100GB. Please check the error message below:-->
Traceback (most recent call last):
File "/home/sasdemo/python/pipeline.py", line 130, in
Loading data from /XIVData/eecteam/casData/imsimDatalabel_10M.csv
Loading completed in 1560.0 seconds
##################
DecisionTreeRegressor
##################
Start time = Mon Oct 30 07:35:48 2017
min_samples_leaf=min_leaf_size), 'y')
File "/home/sasdemo/python/pipeline.py", line 74, in train_model
model.fit(X_train[predictors], X_train[target])
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py", line 1029, in fit
X_idx_sorted=X_idx_sorted)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
**MemoryError**
The code reads a csv input file, creates a Pandas dataframe. Further, when you call the algorithm/modeling technique to create a model using the following statement, it internally tries to create a 'numPy' array which fails with memory issue while creating copy from Pandas dataframe.
model.fit(X_train[predictors], X_train[target])
This issue occurs when you use any machine learning algorithm.
Would request experts to help me solidify my understanding of the issue. Any additional thoughts or recommendations or referenced would be appreciated.
Python/Anaconda version used -- Python 2.7.13 :: Anaconda 4.3.1 (64-bit)