how to train a lasso with both text and numeric variables?

Question

Consider this modified classic example:

library(dplyr)
library(tibble)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "France",
                              "Tokyo Japan Chinese"),
                     add_numeric = c(1, 1, 0, 1),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))


> dtrain
# A tibble: 4 x 4
  text                     add_numeric doc_id class
  <chr>                          <dbl>  <int> <dbl>
1 Chinese Beijing Chinese            1      1     1
2 Chinese Chinese Shanghai           1      2     1
3 France                             0      3     1
4 Tokyo Japan Chinese                1      4     0

Here, I would like to use lasso to predict class. The variables of interest are text and add_numeric.

I know how to use text2vec or tm to predict class using text only: the packages will transform text into a sparse document term matrix and feed the model.

However, here, I want to use both a textual variable text, and add_numeric. I do not know how to mix the two approaches. Any ideas? Thanks!

Training your `Lasso` model on `add_numeric` and the dummy variables obtained from `text` does not solve your problem ? PS : Since `class` is categorical, I would advice you to train a `logistic lasso` model instead of a simple `lasso`. — Neroksi, Jun 16 '18 at 14:34
yes that is the idea but how to do that in practice? my text has thousands of different words so I need to use a dtm from `text2vec` or others. however, i cannot add my numeric to the `dtm` easily in these packages (theyvare only meant for text) — ℕʘʘḆḽḘ, Jun 16 '18 at 14:58
So your question is **how to get dummies from a factor variable** ? This question has been already answered [here](https://stackoverflow.com/questions/11952706/generate-a-dummy-variable) . It's worth it to go take a look on all the methods offered by R. — Neroksi, Jun 16 '18 at 18:03

score 1 · Accepted Answer · answered Jun 16 '18 at 15:32

1

I haven't checked how to do this with text2vec, but with quanteda this is quite easy to do, just using cbind and the advantage is that is stays a sparse matrix. I haven't changed the dimnames so the added column will be shown as feat1.

library(quanteda)

dtm <- dfm(dtrain$text) # create documenttermmatrix
dtm_num <- cbind(dtm, dtrain$add_numeric) # add column to sparse matrix.
dtm_num
Document-feature matrix of: 4 documents, 7 features (60.7% sparse).
4 x 7 sparse Matrix of class "dfm"
       features
docs    chinese beijing shanghai france tokyo japan feat1
  text1       2       1        0      0     0     0     1
  text2       2       0        1      0     0     0     1
  text3       0       0        0      1     0     0     0
  text4       1       0        0      0     1     1     1

answered Jun 16 '18 at 15:32

phiver

23,048
14
44
56

very effective! – ℕʘʘḆḽḘ Jun 16 '18 at 15:38
can we feed this to lasso? – ℕʘʘḆḽḘ Jun 16 '18 at 15:38
yep, directly. `glmnet(dtm_num, y = dtrain$class)` glmnet doc: x Can be in sparse matrix format – phiver Jun 16 '18 at 15:45
hello phiver! `cbind` does not work anymore today... any ideas? thanks – ℕʘʘḆḽḘ Feb 26 '21 at 21:47
1

@ℕʘʘḆḽḘ, What do you mean it doesn't work anymore? If I use the example above everything works fine. quanteda 2.1.2 and R 4.0.2 – phiver Feb 27 '21 at 12:06

how to train a lasso with both text and numeric variables?

1 Answers1