I have a largeish DataFrame
, loaded from a csv file (about 300MB).
From this, I'm extracting a few dozen features to use in a RandomForestClassifier
: some of the features are simply derived from columns in the data, for example:
feature1 = data["SomeColumn"].apply(len)
feature2 = data["AnotherColumn"]
And others are created as new DataFrame
s from numpy arrays, using the index on the original dataframe:
feature3 = pandas.DataFrame(count_array, index=data.index)
All these features are then joined into one DataFrame
:
features = feature1.join(feature2) # etc...
And I train a random forest classifier:
classifier = RandomForestClassifier(
n_estimators=100,
max_features=None,
verbose=2,
compute_importances=True,
n_jobs=n_jobs,
random_state=0,
)
classifier.fit(features, data["TargetColumn"])
The RandomForestClassifier
works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:
data_slice = data[data['somecolumn'] > value]
Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame
is now O(10%) of the original.
I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:
data = pandas.DataFrame(data_slice, copy=True)
but this doesn't help.
- Why would taking a subset of the data massively increase memory use?
- Is there some other way to compact / rearrange a
DataFrame
which might make things more efficient again?