Decision Tree - Sparse dataset

Question

I have very sparse dataset with huge number of attributes (~12 K features and 700K records) I can not fit it in memory (attribute values are binomial i.e. True/False) ,

As it is sparse I keep the dataset in (ID , Feature) format, so for example I would have the following records :
(ID , Feature)
(110 , d_0022)
(110 , d_2393)
(110 , i_2293)
(822 , d_933)
(822 , p_2003)
....

So we would have three attributes with true value (d_0022 ; 2_2393 ; i_2293) for the record with ID : 110 and the rest are false (attributes are all distinct values of the attribute "feature")

Is there any software available which implements an algorithm to train a dataset over this kind of dataset so I don't make the WHOLE dataset first ?

(Currently I am using rapidminer)

score 1 · Answer 1 · edited May 23 '17 at 11:56

You can use R's sparse matrices (example) or Weka with SparseIstance (or even BinarySparseInstance). If sparse matrix still doesn't fit memory, you can use Mahout and little cluster on Amazon EC2 to run SVD, reducing dimensions of your matrix so that they are ok for normal processing.

I have almost no experience with RapidMiner, but possibly it also has some implementation of sparse matrices.

Decision Tree - Sparse dataset

1 Answers1