I am trying to use pre-trained word embeddings in R using GloVe. I have a small corpus with just 40,000 tokens. There are 30 texts, with 3 docvars: speaker, party, number of years in government. The data has been cleaned by removing stop words and punctuation. How do i train a glove model so that i can find the cosine distance between words? The corpus is made up of political statements, of which there are only 30. Is this ecen possible? A large amount of the code i have seen onliune just uses pre-trained embeddings and no other data.
I have done this:
summary(df)
text: Length Class Mode 30 character character
yearInGov: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 3.867 5.750 11.000
party: Length Class Mode 30 character character
speaker: Length Class Mode 30 character character
it <- itoken(df$budgetTextClean,
tokenizer = word_tokenizer,
ids = df$speaker,
progressbar = TRUE)
vocab <- create_vocabulary(it) # use uni-grams
prune the vocabulary of low-frequent words
vocab <- prune_vocabulary(vocab, term_count_min = 3)
What's in the vocabulary?
print(vocab)
vectorizer <- vocab_vectorizer(vocab)
Create a Term-Count-Matrix, by default it will use a skipgram window of 5 (symmetrical)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
maximum number of co-occurrences to use in the weighting function, we choose the entire token set divided by 100
x_max <- length(vocab$doc_count)/100
set up the embedding matrix and fit model
glove_model <- GloVe$new(rank = 300, x_max = 10)
glove_embedding <- glove_model$fit_transform(tcm, n_iter = 20, convergence_tol = 0.01, n_threads = 4)
combine main embedding and context embeddings (sum) into one matrix
glove_embedding = glove_embedding+ t(glove_model$components) # the transpose of the context matrix