3

I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j). How do I ensure the values for w_i and w_j are correct?

Here's a minimal reproducible example. First, prepare some corpora to compare, taken from the quanteda tutorial. I am using dfm_match(all_words) to try and ensure all words are present in each set, but this doesn't seem to have the desired effect.

library(quanteda)

# from https://quanteda.io/articles/pkgdown/replication/text2vec.html

# get a list of all words in all documents
all_words <-
  data_corpus_inaugural %>% 
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) %>% 
  types()

# should expect this mean features in each set
length(all_words)

# these are our three sets that we want to compare, we want to project the
# change in a few key words on a fixed background of other words
corpus_1 <- data_corpus_inaugural[1:19]
corpus_2 <- data_corpus_inaugural[20:39]
corpus_3 <- data_corpus_inaugural[40:58]

my_tokens1 <- texts(corpus_1) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_tokens2 <- texts(corpus_2) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_tokens3 <- texts(corpus_3) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_feats1 <- 
  dfm(my_tokens1, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>% 
  dfm_match(all_words) %>% 
  featnames()

my_feats2 <- 
  dfm(my_tokens2, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>%
  dfm_match(all_words) %>% 
  featnames()

my_feats3 <- 
  dfm(my_tokens3, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>%
  dfm_match(all_words) %>% 
  featnames()

# leave the pads so that non-adjacent words will not become adjacent
my_toks1_2 <- tokens_select(my_tokens1, my_feats1, padding = TRUE)
my_toks2_2 <- tokens_select(my_tokens2, my_feats2, padding = TRUE)
my_toks3_2 <- tokens_select(my_tokens3, my_feats3, padding = TRUE)

# Construct the feature co-occurrence matrix
my_fcm1 <- fcm(my_toks1_2, context = "window", tri = TRUE)
my_fcm2 <- fcm(my_toks2_2, context = "window", tri = TRUE)
my_fcm3 <- fcm(my_toks3_2, context = "window", tri = TRUE)

Somewhere in the above steps I believe I need to ensure that the fcm for each set has all the words of all sets to get the matrix dimensions the same, but I'm not sure how to accomplish that.

Now fit the word embedding model for the first set:


library("text2vec")

glove1 <- GlobalVectors$new(rank = 50, 
                            x_max = 10)

my_main1 <- glove1$fit_transform(my_fcm1, 
                               n_iter = 10,
                               convergence_tol = 0.01, 
                               n_threads = 8)

my_context1 <- glove1$components
word_vectors1 <- my_main1 + t(my_context1)

And here is where I get stuck, I want to initialise the second model with the first, so that the coordinate system will be comparable between the first and second models. I read that w_i and w_j are main and context words, and b_i and b_j are biases. I've found output for those in my first model object, but I get an error:

glove2 <- GlobalVectors$new(rank = 50, 
                            x_max = 10,
                            init = list(w_i = glove1$.__enclos_env__$private$w_i, 
                                        b_i = glove1$.__enclos_env__$private$b_i, 
                                        w_j = glove1$.__enclos_env__$private$w_j, 
                                        b_j = glove1$.__enclos_env__$private$b_j))

my_main2 <- glove2$fit_transform(my_fcm2, 
                                 n_iter = 10,
                                 convergence_tol = 0.01, 
                                 n_threads = 8)

The error is Error in glove2$fit_transform(my_fcm2, n_iter = 10, convergence_tol = 0.01, : init values provided in the constructor don't match expected dimensions from the input matrix

Assuming I have identified w_i, etc., correctly in the first model, how can I get ensure they are the correct size?

Here's my session info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] text2vec_0.6   quanteda_2.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4            pillar_1.4.3          compiler_3.6.0        tools_3.6.0           stopwords_1.0        
 [6] digest_0.6.25         packrat_0.5.0         lifecycle_0.2.0       tibble_3.0.0          gtable_0.3.0         
[11] lattice_0.20-40       pkgconfig_2.0.3       rlang_0.4.5           Matrix_1.2-18         fastmatch_1.1-0      
[16] cli_2.0.2             rstudioapi_0.11       mlapi_0.1.0           parallel_3.6.0        RhpcBLASctl_0.20-17  
[21] dplyr_0.8.5           vctrs_0.2.4           grid_3.6.0            tidyselect_1.0.0.9000 glue_1.3.2           
[26] data.table_1.12.8     R6_2.4.1              fansi_0.4.1           lgr_0.3.4             ggplot2_3.3.0        
[31] purrr_0.3.3           magrittr_1.5          scales_1.1.0          ellipsis_0.3.0        assertthat_0.2.1     
[36] float_0.2-3           rsparse_0.4.0         colorspace_1.4-1      stringi_1.4.6         RcppParallel_5.0.0   
[41] munsell_0.5.0         crayon_1.3.4.9000 

Ben
  • 41,615
  • 18
  • 132
  • 227
  • 1
    Not a full answer, but have a look at this excellent recent paper: https://www.aclweb.org/anthology/P19-1044 - they compare different methods for aligning corpora (in their case diachronic, which is what motivated the previous question you linked - but it doesn't matter really). They don't test glove, but do compare cooccurrence-matrix based approaches (which is what Glove is) and word2vec. I've done something similar with LSA and it worked fine. – user3554004 Apr 11 '20 at 20:35
  • Thanks very much for that paper, that is helpful. Do I understand correctly that their 'Temporal Referencing' is the same as what you described in your SO question that I linked to? It seems quite straightforward. Is that the method that you also ended up using or did you do something else? – Ben Apr 12 '20 at 06:16
  • it's similar ideas, basically comes down to making sure the contexts are aligned, which you can either do by explicit referencing, or aligning co-occurrence matrices by context words/columns and just use the sparse vectors; or do that before training a model like LSA or Glove (I never got it to work with the text2vec Glove implementation, but I suppose it wasn't implemented with that sort of application in mind either). Of course this wouldn't work for word2vec type things since there's no matrix, so they provide the referencing solution for that). – user3554004 Apr 12 '20 at 14:19

1 Answers1

1

Here is a working example. See ?rsparse::GloVe documentation for details.

library(rsparse)
data("movielens100k")
x = crossprod(sign(movielens100k))

model = GloVe$new(rank = 10, x_max = 5)

w_i = model$fit_transform(x = x, n_iter = 5, n_threads = 1)
w_j = model$components
init = list(w_i = t(w_i), model$bias_i, w_j = w_j, b_j = model$bias_j)

model2 = GloVe$new(rank = 10, x_max = 10, init = init)
w_i2 = model2$fit_transform(x)
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks, that does solve the problem of 'how to reference the w_i, w_j, etc elements. But only if the models use corpora of the exact same size. That's almost never the case in time series analysis, so I guess there is a step we can take to force them to be the same size before we train the model? – Ben Apr 15 '20 at 21:16
  • You will need to find intersection of the words I two datasets and init model with vectors and biases corresponding to common words. And co occurrence matrix for the second model should also consists only of common words – Dmitriy Selivanov Apr 16 '20 at 11:22