In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns?

Question

This is really a two part question but didn't want to make the title too long.

I'm implementing xgboost on some data which is categorical but first I must convert to a sparse matrix. Originally I was using the Matrix library's "sparse.model.matrix()" call but found it was too slow. After asking on here how to efficiently do that, I was directed to flodel's answer.

However, I noticed that for my data, the resulting number of columns through both methods did not match. Furthermore, I discovered that flodel's answer seems make more sense to me, as the number of columns generated through flodel's answer exactly matched the number of unique categories across all the columns in the original data set (which makes sense, for a one-hot encoding), which I computed by iterating through the columns (which are factors) of the unconverted data frame and simply summing nlevels(df$column). The number of columns generated through a call to sparse.model.matrix() is actually less, so my first question is:

1) Why does sparse.model.matrix() produce a one-hot encoding with less than the expected number of columns for a true one-hot encoding?

After having trained with xgboost using both sparse matrices, I obtained different results when I ask xgboost which variables are most influential (as determined by a call to xgb.importance() ). Thus

2) Which model should I trust? The one formed via training a sparse matrix encoded by sparse.model.matrix, or the one via flodel's method (which makes more sense to me)?

Thank you.

(1) You typically have (num.of.groups-1) columns when modelling a categorical variable because you have a reference category. Comparisons have to be relative to something. — thelatemail, Jun 06 '18 at 23:16
you can force model.matrix to include all levels: (see https://stackoverflow.com/questions/4560459/all-levels-of-a-factor-in-a-model-matrix-in-r/4569239#4569239) - you need to decide which is appropriate — user20650, Jun 07 '18 at 00:38
Hi user20650, that was helpful, unfortunately it is still too slow. I'm wondering if there is a way to do the opposite, that is, adapt flodel's answer to return identical output of sparse.model.matrix? — Isaac T, Jun 11 '18 at 16:59

In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns?

0 Answers0

Linked