5

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target variable(whether a comment is actionable or not). I was able to generate Glove word embeddings for textual data using the following code from text2vec documentation.

glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary = 
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)

How do i build a model and generate predictions on test data?

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
sri sivani charan
  • 399
  • 1
  • 6
  • 21
  • @blacksite you can go through the following link for Python implementation [Using pre-trained word embeddings in a Keras model](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) – sri sivani charan May 18 '18 at 20:13

2 Answers2

1

text2vec has a standard predict method (like most of the R libraries anyway) that you can use in a straightforward fashion: have a look at the documentation.

To make a long story short, just use

predictions <- predict(fitted_model, data)
gented
  • 1,620
  • 1
  • 16
  • 20
  • Thank you for your reply. I have used predict(fitted_model, data) for tfidf and ngram approaches. But my question is how do i build a model after generating Word embeddings using GloVe. In the text2vec documentation, only word embeddings are used to check the analogy accuracy against a pretrained file. – sri sivani charan Mar 06 '18 at 00:32
  • What do you mean by "building a model"? Do you want to fit the model, predict on new data, get the vectors for each word? What is it that you want to do in particular (and that you have not found in the documentation)? – gented Mar 06 '18 at 08:29
  • this is my understanding from the [documentation](https://cran.r-project.org/web/packages/text2vec/text2vec.pdf). I am a bit confused. Correct me if i am wrong. first a model is built on the train data using glove$fit_transform which fits glove model to input matrix. This just generates word embeddings for textual data. So how do i build a model using these word embeddings and my target variable? Then after building a model how do i fit this model to test data to generate predictions? – sri sivani charan Mar 06 '18 at 17:33
  • The word embeddings *are* the model: *"how do i build a model using these word embeddings and my target variable"* what is it that you want to "build" now? What "model"? – gented Mar 06 '18 at 17:56
  • These embeddings are only performed on one column which consists of textual data. My data has two columns one consists of textual data and the other one is target variable which says whether it is actionable or not. I have created a Co-Occurence matrix on textual data and then generated these word embeddings. I want to build a classification model on this textual data to find if a comment is actionable or not. – sri sivani charan Mar 06 '18 at 19:21
  • I do not understand what all these things mean. What is a "target variable" in a context of words embeddings and text model? It seems that you are trying to solve a different problem where word embeddings are just one piece of information (but in this case the question that you are asking is outside the scope) – gented Mar 06 '18 at 19:43
  • exactly that is my point. word embeddings are just a piece of information. Simply, i want to build a model to predict whether the comments (textual) are actionable or not (target variable). How do i do this using the word embeddings from textual data? – sri sivani charan Mar 06 '18 at 20:52
  • Well, but then this is a whole different question than what you stated in the original post. – gented Mar 06 '18 at 20:53
  • Sorry, my bad i haven't conveyed my question properly in the heading. I have edited the heading and question. – sri sivani charan Mar 07 '18 at 00:08
0

Got it.

glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary = 
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter =20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)

After creating word embeddings, build an index that maps words(strings) to their vector representations(numbers)

embeddings_index <- new.env(parent = emptyenv())
for (line in lines) {
values <- strsplit(line, ' ', fixed = TRUE)[[1]]    
word <- values[[1]]
coefs <- as.numeric(values[-1])
embeddings_index[[word]] <- coefs
}

Next, build an embedding matrix of shape (max_words,embedding_dim) which can be loaded into an embedding layer.

embedding_dim <- 50 (number of dimensions you wish to represent each word).
embedding_matrix <- array(0,c(max_words,embedding_dim))
for(word in names(word_index)){
  index <- word_index[[word]]
  if(index < max_words){
    embedding_vector <- embeddings_index[[word]]
    if(!is.null(embedding_vector)){
      embedding_matrix[index+1,] <- embedding_vector  #words not found in 
the embedding index will all be zeros
    }
  }
}
We can then load this embedding matrix into the embedding layer, build a 
model and then generate predictions.

model_pretrained <- keras_model_sequential() %>% layer_embedding(input_dim = max_words,output_dim = embedding_dim) %>%
                layer_flatten()%>%layer_dense(units=32,activation = "relu")%>%layer_dense(units = 1,activation = "sigmoid")
summary(model_pretrained)

#Loading the glove embeddings in the model
get_layer(model_pretrained,index = 1) %>% 
set_weights(list(embedding_matrix)) %>% freeze_weights()

model_pretrained %>% compile(optimizer = "rmsprop",loss="binary_crossentropy",metrics=c("accuracy"))

history <-model_pretrained%>%fit(x_train,y_train,validation_data = list(x_val,y_val),
                                epochs = num_epochs,batch_size = 32) 

Then use standard predict function to generate predictions.

Check the following links. Use word embeddings to build a model in Keras

Pre-trained word embeddings

sri sivani charan
  • 399
  • 1
  • 6
  • 21