3

I want to run sklearn's RandomForestClassifier on some data that is packed as a numpy.ndarray which happens to be sparse. Calling fit gives ValueError: setting an array element with a sequence.. From other posts I understand that random forest cannot handle sparse data.

I expected the object to have a todense method, but it doesn't.

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

I tried wrapping it with a SciPy csr_matrix but that gives errors as well.

Is there any way to make random forest accept this data? (not sure that dense would actually fit in memory, but that's another thing...)

EDIT 1

The code generating the error is just this:

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

As for the suggestion to use toarray(), ndarray does not have such method. AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array. Is there an option to run RandomForestClassifier with a sparse array?

EDIT 2

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

mibm
  • 1,328
  • 2
  • 13
  • 23
  • 1
    Please provide a [mcve] so that we can see which code causes the error. Right now, you're only showing the type of `X_train`, but neither its shape nor how you are feeding it into the `RandomForestClassifier`. Likely, the data is not shaped correctly, see the answer to [this related question](https://stackoverflow.com/questions/4674473/valueerror-setting-an-array-element-with-a-sequence). – IonicSolutions Apr 11 '19 at 16:52

4 Answers4

8
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

means that your code, or something it calls, has done np.array(M) where M is a csr sparse matrix. It just wraps that matrix in a object dtype array.

To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • The object I get is `ndarray` which does not have `toarray` or `todense`. I cannot see any method that would convert that to a csr_matrix – mibm Apr 14 '19 at 07:12
  • Use `X_train[()]` to take the wrongly saved matrix out of the array wrapper. Then use `toarray`. – hpaulj Apr 14 '19 at 11:26
1

I believe you're looking for the toarray method, as shown in the documentation.

So you can do, e.g., X_dense = X_train.toarray().

Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).

Nathan
  • 9,651
  • 4
  • 45
  • 65
  • `ndarray` does not have to `toarray` method, otherwise I would not pose the question. And you are correct -- the array would require terabytes (I think "just" 2.2) which is not practical. – mibm Apr 14 '19 at 07:13
0

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

RandomForestClassifier can run using data in this format. The code has been running for 1:30h now, so hopefully it will actually finish :-)

mibm
  • 1,328
  • 2
  • 13
  • 23
0

Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix. You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()

temp = csr_matrix(X_train.all())
X_train = temp.toarray()