1

I'm trying to train using the xgboost algorithm. This algorithm requires that the data be numerical, and I believe even more specifically, of class dgCMatrix (I could be wrong on this last point).

I have data stored in a data frame that is categorical, that is, each column of the data set contains discrete strings such as 'class1', 'class2', etc, that are not ordered. Currently, I am using this:

sparseMatrix <- sparse.matrix.model(someColName ~.-1, data = myDataFrame)

(Tbh, I'm not really sure how the "someColName ~.-1" works but it appears to delete a column, so disregard that portion of the code if it doesn't make sense.)

When I type str(myDataFrame), all of the data is stored as character data and not as a factor.

My problem is that the line I posted takes a really long time. So my question is, what is the fastest way I can convert this data into numerical data/dgCMatrix so that it will be compatible with xgboost? I don't believe it necessarily NEEDS to be a sparse matrix, it's just that as a sparse matrix, it is numerical. Would it be more efficient to convert each column into a factor first, the convert to a sparse matrix? Any help is appreciated. Thank you!

Isaac T
  • 31
  • 1
  • flodel provides a quick way here -> https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r – user20650 Jun 06 '18 at 17:08

0 Answers0