Pandas & Scikit: memory usage when slicing DataFrame

Question

I have a largeish DataFrame, loaded from a csv file (about 300MB).

From this, I'm extracting a few dozen features to use in a RandomForestClassifier: some of the features are simply derived from columns in the data, for example:

 feature1 = data["SomeColumn"].apply(len)
 feature2 = data["AnotherColumn"]

And others are created as new DataFrames from numpy arrays, using the index on the original dataframe:

feature3 = pandas.DataFrame(count_array, index=data.index)

All these features are then joined into one DataFrame:

features = feature1.join(feature2) # etc...

And I train a random forest classifier:

classifier = RandomForestClassifier(
    n_estimators=100,
    max_features=None,
    verbose=2,
    compute_importances=True,
    n_jobs=n_jobs,
    random_state=0,
)
classifier.fit(features, data["TargetColumn"])

The RandomForestClassifier works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:

data_slice = data[data['somecolumn'] > value]

Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame is now O(10%) of the original.

I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:

data = pandas.DataFrame(data_slice, copy=True)

but this doesn't help.

Why would taking a subset of the data massively increase memory use?
Is there some other way to compact / rearrange a DataFrame which might make things more efficient again?

this question looks like it is for the kaggle contest (predict stackoverflow closed questions)? — ronalchn, Sep 01 '12 at 12:06

score 4 · Accepted Answer · answered Sep 01 '12 at 13:05

4

The RandomForestClassifier is copying the dataset several times in memory, especially when n_jobs is large. We are aware of those issues and it's a priority to fix them:

I am currently working on a subclass of the multiprocessing.Pool class of the standard library that will do no memory copy when numpy.memmap instances are passed to the subprocess workers. This will make it possible to share the memory of the source dataset + some precomputed datastructures between the workers. Once this is fixed I will close this issue on the github tracker.
There is an ongoing refactoring that will further decrease the memory usage of RandomForestClassifier by two. However the current state of the refactoring is twice as slow as the master, hence further work is still required.

However none of those fixes will make it to 0.12 release that is scheduled for release next week. Most probably they will be done for 0.13 (planned for release in 3 to 4 months) but offcourse will be available in the master branch a lot sooner.

answered Sep 01 '12 at 13:05

ogrisel

39,309
12
116
125

Is there any particular reason the memory usage should increase so dramatically (at least tenfold) when I use only a slice of the data though? `data = data[data['col'] > x]`. Otherwise the memory use is manageable. – James Sep 01 '12 at 14:53
You mean you have a 10 folds increase without even calling `RandomForestClassifier`? Have you compared with `n_jobs=1`? Also please note that `RandomForestClassifier` will copy the input to make it an homogenenous fortran aligned numpy array with `dtypes=np.float64`. So that might be another cause of hidden yet expected data usage. Install psutil and http://pypi.python.org/pypi/memory_profiler to find the culprit for your case. – ogrisel Sep 01 '12 at 16:08
No the huge increase happens while fitting the classifier, but only if the data being used was sliced (by a boolean index as described). (The data is being used as (some of) the features, and the target values). It's definitely more than a single extra copy, I'll investigate with the profiler. – James Sep 01 '12 at 17:24
Investigate first with `n_jobs=1` to avoid having to deal with multiprocessing issues that I am currently working on. If the other source of memory allocation is `X_argsorted` then this is the second known issue we are working on. If you find additional sources of memory inefficiency please feel free to report them on the mailing list or as new github issues. – ogrisel Sep 01 '12 at 17:27
Okay, looks like it's only an issue with n_jobs > 1. – James Sep 02 '12 at 14:28
Ok so this is the problem of memory copy between python processes. The tooling on shared memory I am working on will help fix that. – ogrisel Sep 02 '12 at 18:45

Pandas & Scikit: memory usage when slicing DataFrame

1 Answers1

Linked