2

I see lots of functionality in the text2vec package to tokenize strings and make DTMs, but is there a way to create sequences? The Rstudio keras library has this, but it is incredibly slow. The idea being that instead of returning a matrix, you return a list of vectors of tokenized items, one per item in the input vector.

https://keras.rstudio.com/reference/texts_to_sequences.html

This feels like something that should be obvious but I can't seem to find it.

Jacqueline Nolis
  • 1,457
  • 15
  • 22
  • 1
    Agreed, should be obvious, but there just isn't a very robust application of word/document vectorization that I know of with R. I tried to use that package but just didn't make much sense to me. With `sparklyr` you can use the built in word2vec models in Spark but that requires you to use spark which is a whole other learning challenge. I recently did all my text prep in R using the `tidytext` package but used python and the gensim library to do the doc2vec. It was very straightforward. – Ben G Mar 11 '19 at 18:54
  • does splitting the sparse matrix provide a way forward. If `m` is your `dtm`: `dp <- diff(m@p) ; cc = rep(seq_along(dp),dp) ;s = split(cc, m@i)` . This produces a list of sequences, where the values are the common indices for each row. – user20650 Apr 02 '19 at 11:29
  • Unfortunately @user20650 it's really important to preserve the ordering of the words for our case. – Jacqueline Nolis Apr 02 '19 at 17:29

0 Answers0