1

Consider this modified classic example:

library(dplyr)
library(tibble)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "France",
                              "Tokyo Japan Chinese"),
                     add_numeric = c(1, 1, 0, 1),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))


> dtrain
# A tibble: 4 x 4
  text                     add_numeric doc_id class
  <chr>                          <dbl>  <int> <dbl>
1 Chinese Beijing Chinese            1      1     1
2 Chinese Chinese Shanghai           1      2     1
3 France                             0      3     1
4 Tokyo Japan Chinese                1      4     0

Here, I would like to use lasso to predict class. The variables of interest are text and add_numeric.

I know how to use text2vec or tm to predict class using text only: the packages will transform text into a sparse document term matrix and feed the model.

However, here, I want to use both a textual variable text, and add_numeric. I do not know how to mix the two approaches. Any ideas? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • Training your `Lasso` model on `add_numeric` and the dummy variables obtained from `text` does not solve your problem ? PS : Since `class` is categorical, I would advice you to train a `logistic lasso` model instead of a simple `lasso`. – Neroksi Jun 16 '18 at 14:34
  • yes that is the idea but how to do that in practice? my text has thousands of different words so I need to use a dtm from `text2vec` or others. however, i cannot add my numeric to the `dtm` easily in these packages (theyvare only meant for text) – ℕʘʘḆḽḘ Jun 16 '18 at 14:58
  • So your question is **how to get dummies from a factor variable** ? This question has been already answered [here](https://stackoverflow.com/questions/11952706/generate-a-dummy-variable) . It's worth it to go take a look on all the methods offered by R. – Neroksi Jun 16 '18 at 18:03

1 Answers1

1

I haven't checked how to do this with text2vec, but with quanteda this is quite easy to do, just using cbind and the advantage is that is stays a sparse matrix. I haven't changed the dimnames so the added column will be shown as feat1.

library(quanteda)

dtm <- dfm(dtrain$text) # create documenttermmatrix
dtm_num <- cbind(dtm, dtrain$add_numeric) # add column to sparse matrix.
dtm_num
Document-feature matrix of: 4 documents, 7 features (60.7% sparse).
4 x 7 sparse Matrix of class "dfm"
       features
docs    chinese beijing shanghai france tokyo japan feat1
  text1       2       1        0      0     0     0     1
  text2       2       0        1      0     0     0     1
  text3       0       0        0      1     0     0     0
  text4       1       0        0      0     1     1     1
phiver
  • 23,048
  • 14
  • 44
  • 56