I'm trying to train using the xgboost algorithm. This algorithm requires that the data be numerical, and I believe even more specifically, of class dgCMatrix (I could be wrong on this last point).
I have data stored in a data frame that is categorical, that is, each column of the data set contains discrete strings such as 'class1', 'class2', etc, that are not ordered. Currently, I am using this:
sparseMatrix <- sparse.matrix.model(someColName ~.-1, data = myDataFrame)
(Tbh, I'm not really sure how the "someColName ~.-1" works but it appears to delete a column, so disregard that portion of the code if it doesn't make sense.)
When I type str(myDataFrame), all of the data is stored as character data and not as a factor.
My problem is that the line I posted takes a really long time. So my question is, what is the fastest way I can convert this data into numerical data/dgCMatrix so that it will be compatible with xgboost? I don't believe it necessarily NEEDS to be a sparse matrix, it's just that as a sparse matrix, it is numerical. Would it be more efficient to convert each column into a factor first, the convert to a sparse matrix? Any help is appreciated. Thank you!