Questions tagged [text2vec]

text2vec - R package which provides a fast and memory efficient framework for text mining applications within R. Vectorization, word embeddings, topic modelling and more.

text2vec goal is to provide tools to easily perform text mining in R with C++ speeds:

Core parts written in C++
Small memory footprint
Concise, pipe friendly API
No need load all data into RAM - process it in chunks
Easily vertical scaling with multiple cores, threads.

See development page at github.

111 questions

votes

2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…

asked Jul 22 '15 at 17:50

Zach

29,791
35
142
201

votes

1 answer

Stemming function for text2vec

I am using text2vec in R and having difficulty writing a stemming function that works with the itoken function in the text2vec package. The text2vec documentation suggests this stemming function: stem_tokenizer1 =function(x) { word_tokenizer(x)…

r text2vec

asked Nov 21 '16 at 11:13

rreedd

votes

2 answers

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say: Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e.…

normalization cosine-similarity text2vec vector-space

asked Jul 11 '18 at 17:10

user3554004

1,044
9
24

votes

2 answers

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target…

r word2vec text-classification word-embedding text2vec

asked Mar 05 '18 at 22:23

sri sivani charan

votes

3 answers

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency)…

r nlp n-gram text2vec

asked Apr 21 '16 at 21:06

Sasha

5,783
8
33
37

votes

1 answer

From word vector to document vector [text2vec]

I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to…

r text2vec

asked Dec 03 '17 at 06:11

D. K.

votes

1 answer

Text2Vec classification with caret problems

Some context: Working with text classification and big sparse matrices in R I have been working on a text multi-class classification problem with the text2vec package and caret. The plan is to use text2vec for building the document-term matrix,…

r svm r-caret text-classification text2vec

asked Aug 04 '16 at 13:19

Ed.

votes

0 answers

Support for large sparse matrices R

Is there any support for large sparse matrices in R? I'm currently dealing with a 1.9M sparse square matrix with about 0.001 density. I wanted to stress test the creating of this matrix in R on my AWS spot instance with 480gb…

r sparse-matrix reticulate text2vec

asked May 07 '20 at 10:44

Olivier

votes

1 answer

Convert DocumentTermMatrix to dgTMatrix

I'm trying to run the AssociatedPress dataset from the tm-package through text2vec's LDA implementation. The problem I'm facing is the incompatibility of data types: AssociatedPress is a tm::DocumentTermMatrix which in turn is a subclass of…

r tm text2vec

asked Apr 14 '18 at 20:12

Oliver Baumann

2,209
1
10
26

votes

1 answer

Can text2vec and topicmodels generate similar topics with suitable parameter settings for LDA?

I was wondering how results of different packages, hence, algorithms, differ and if parameters could be set in a way to produce similar topics. I had a look at the packages text2vec and topicmodels in particular. I used below code to compare 10…

r lda topicmodels text2vec

asked Oct 17 '17 at 10:43

Manuel Bickel

2,156
2
11
22

votes

2 answers

How to resolve R Error using text2vec glove function: unused argument (grain_size = 100000)?

Trying to work through the text2vec vignette in the documentation and here to create word embeddings for some tweets: head(twtdf$Tweet.content) [1] "$NFLX $GS $INTC $YHOO $LVS\n$MSFT $HOG $QCOM $LUV $UAL\n$MLNX $UA $BIIB $GOOGL $GM $V\n$SKX $GE $CAT…

r nlp word-embedding text2vec

asked Apr 11 '17 at 10:49

xq1515426

votes

1 answer

Error in UseMethod("itoken")

I have a dataframe IRC_DF and I would like to create an iterator over input objects to vocabularies, for this I try to do like this : it_train <- itoken(IRC_DF$Raison.Reco, preprocessor = prep_fun, tokenizer = tok_fun, ids =…

r text2vec

asked Mar 10 '17 at 13:18

Datackatlon

votes

1 answer

How to align two GloVe models in text2vec?

Let's say I have trained two separate GloVe vector space models (using text2vec in R) based on two different corpora. There could be different reasons for doing so: the two base corpora may come from two different time periods, or two very different…

matrix nlp text2vec

asked Nov 19 '16 at 20:17

user3554004

1,044
9
24

votes

0 answers

R package text2vec--tokenize to sequences

I see lots of functionality in the text2vec package to tokenize strings and make DTMs, but is there a way to create sequences? The Rstudio keras library has this, but it is incredibly slow. The idea being that instead of returning a matrix, you…

r text2vec

asked Mar 11 '19 at 17:48

Jacqueline Nolis

1,457
15
22

votes

0 answers

Beginner advice about adding start/end sentence markers: using Quanteda functionalities versus doing it manually (custom code)

I need to add begin and end sentence markers to some texts that I analyze using Quanteda. I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box". Searching for an answer I found a different…

regex nlp quanteda text2vec

asked Aug 01 '18 at 07:18

user778806

2 3 4 5 6 7 8 Next