0

Description

I am running a python code using scikit-learn machine learning algorithm where the size of the input table is close to 100GB. Please check the error message below:-->

Traceback (most recent call last):
File "/home/sasdemo/python/pipeline.py", line 130, in
Loading data from /XIVData/eecteam/casData/imsimDatalabel_10M.csv
Loading completed in 1560.0 seconds
##################
DecisionTreeRegressor
##################
Start time = Mon Oct 30 07:35:48 2017
min_samples_leaf=min_leaf_size), 'y')
File "/home/sasdemo/python/pipeline.py", line 74, in train_model
model.fit(X_train[predictors], X_train[target])
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py", line 1029, in fit
X_idx_sorted=X_idx_sorted)
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
**MemoryError**

The code reads a csv input file, creates a Pandas dataframe. Further, when you call the algorithm/modeling technique to create a model using the following statement, it internally tries to create a 'numPy' array which fails with memory issue while creating copy from Pandas dataframe.

model.fit(X_train[predictors], X_train[target])

This issue occurs when you use any machine learning algorithm.

Would request experts to help me solidify my understanding of the issue. Any additional thoughts or recommendations or referenced would be appreciated.

Python/Anaconda version used -- Python 2.7.13 :: Anaconda 4.3.1 (64-bit)

  • 1
    Do you have more than 100GB ram or why would you expect to be able to work with such a big file? – MB-F Nov 07 '17 at 12:17
  • I have 384GB RAM available on the system. – Rajesh Naidu Nov 08 '17 at 09:26
  • Ok, so the file fits into memory but you need to avoid temporary copies (which is not always possible with off-the-shelf functions). I would try to avoid Pandas and load the data directly into a numpy array ([see here](https://stackoverflow.com/questions/3518778/how-to-read-csv-into-record-array-in-numpy)). Also pay attention to data types. If the file contains only small integers and something converts them to float it may multiply the memory consumption. – MB-F Nov 08 '17 at 09:58
  • Thanks for the inputs. I will try this and see if it works – Rajesh Naidu Nov 08 '17 at 13:03

0 Answers0