I want to run sklearn
's RandomForestClassifier
on some data that is packed as a numpy.ndarray
which happens to be sparse.
Calling fit
gives ValueError: setting an array element with a sequence.
. From other posts I understand that random forest cannot handle sparse data.
I expected the object to have a todense
method, but it doesn't.
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
with 141256894 stored elements in Compressed Sparse Row format>,
dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>
I tried wrapping it with a SciPy csr_matrix
but that gives errors as well.
Is there any way to make random forest accept this data? (not sure that dense would actually fit in memory, but that's another thing...)
EDIT 1
The code generating the error is just this:
X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')
model = RandomForestClassifier()
model.fit(X_train, train_gt.target)
As for the suggestion to use toarray()
, ndarray does not have such method.
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'
Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array. Is there an option to run RandomForestClassifier
with a sparse array?
EDIT 2
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.