Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

  1. Core parts written in C++
  2. Small memory footprint
  3. Concise, pipe friendly API
  4. No need load all data into RAM - process it in chunks
  5. Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions
16
votes
2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…
Zach
  • 29,791
  • 35
  • 142
  • 201
8
votes
1 answer

Stemming function for text2vec

I am using text2vec in R and having difficulty writing a stemming function that works with the itoken function in the text2vec package. The text2vec documentation suggests this stemming function: stem_tokenizer1 =function(x) { word_tokenizer(x)…
rreedd
  • 83
  • 5
7
votes
2 answers

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say: Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e.…
user3554004
  • 1,044
  • 9
  • 24
5
votes
2 answers

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target…
5
votes
3 answers

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency)…
Sasha
  • 5,783
  • 8
  • 33
  • 37
4
votes
1 answer

From word vector to document vector [text2vec]

I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to…
D. K.
  • 73
  • 7
4
votes
1 answer

Text2Vec classification with caret problems

Some context: Working with text classification and big sparse matrices in R I have been working on a text multi-class classification problem with the text2vec package and caret. The plan is to use text2vec for building the document-term matrix,…
Ed.
  • 846
  • 6
  • 24
3
votes
0 answers

Support for large sparse matrices R

Is there any support for large sparse matrices in R? I'm currently dealing with a 1.9M sparse square matrix with about 0.001 density. I wanted to stress test the creating of this matrix in R on my AWS spot instance with 480gb…
Olivier
  • 321
  • 2
  • 11
3
votes
1 answer

Convert DocumentTermMatrix to dgTMatrix

I'm trying to run the AssociatedPress dataset from the tm-package through text2vec's LDA implementation. The problem I'm facing is the incompatibility of data types: AssociatedPress is a tm::DocumentTermMatrix which in turn is a subclass of…
Oliver Baumann
  • 2,209
  • 1
  • 10
  • 26
3
votes
1 answer

Can text2vec and topicmodels generate similar topics with suitable parameter settings for LDA?

I was wondering how results of different packages, hence, algorithms, differ and if parameters could be set in a way to produce similar topics. I had a look at the packages text2vec and topicmodels in particular. I used below code to compare 10…
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
3
votes
2 answers

How to resolve R Error using text2vec glove function: unused argument (grain_size = 100000)?

Trying to work through the text2vec vignette in the documentation and here to create word embeddings for some tweets: head(twtdf$Tweet.content) [1] "$NFLX $GS $INTC $YHOO $LVS\n$MSFT $HOG $QCOM $LUV $UAL\n$MLNX $UA $BIIB $GOOGL $GM $V\n$SKX $GE $CAT…
xq1515426
  • 89
  • 9
3
votes
1 answer

Error in UseMethod("itoken")

I have a dataframe IRC_DF and I would like to create an iterator over input objects to vocabularies, for this I try to do like this : it_train <- itoken(IRC_DF$Raison.Reco, preprocessor = prep_fun, tokenizer = tok_fun, ids =…
Datackatlon
  • 199
  • 1
  • 4
  • 15
3
votes
1 answer

How to align two GloVe models in text2vec?

Let's say I have trained two separate GloVe vector space models (using text2vec in R) based on two different corpora. There could be different reasons for doing so: the two base corpora may come from two different time periods, or two very different…
user3554004
  • 1,044
  • 9
  • 24
2
votes
0 answers

R package text2vec--tokenize to sequences

I see lots of functionality in the text2vec package to tokenize strings and make DTMs, but is there a way to create sequences? The Rstudio keras library has this, but it is incredibly slow. The idea being that instead of returning a matrix, you…
Jacqueline Nolis
  • 1,457
  • 15
  • 22
2
votes
0 answers

Beginner advice about adding start/end sentence markers: using Quanteda functionalities versus doing it manually (custom code)

I need to add begin and end sentence markers to some texts that I analyze using Quanteda. I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box". Searching for an answer I found a different…
user778806
  • 67
  • 6
1
2 3 4 5 6 7 8