0

I am trying to use pre-trained word embeddings in R using GloVe. I have a small corpus with just 40,000 tokens. There are 30 texts, with 3 docvars: speaker, party, number of years in government. The data has been cleaned by removing stop words and punctuation. How do i train a glove model so that i can find the cosine distance between words? The corpus is made up of political statements, of which there are only 30. Is this ecen possible? A large amount of the code i have seen onliune just uses pre-trained embeddings and no other data.

I have done this:

summary(df)

text: Length Class Mode 30 character character

yearInGov: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 3.867 5.750 11.000

party: Length Class Mode 30 character character

speaker: Length Class Mode 30 character character

it <- itoken(df$budgetTextClean, 
                   tokenizer = word_tokenizer,
                   ids = df$speaker,
                   progressbar = TRUE)

vocab <- create_vocabulary(it) # use uni-grams

prune the vocabulary of low-frequent words

vocab <- prune_vocabulary(vocab, term_count_min = 3)

What's in the vocabulary?

print(vocab)

vectorizer <- vocab_vectorizer(vocab)

Create a Term-Count-Matrix, by default it will use a skipgram window of 5 (symmetrical)

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

maximum number of co-occurrences to use in the weighting function, we choose the entire token set divided by 100 x_max <- length(vocab$doc_count)/100

set up the embedding matrix and fit model

glove_model <- GloVe$new(rank = 300, x_max = 10) 
glove_embedding <-  glove_model$fit_transform(tcm, n_iter = 20, convergence_tol = 0.01, n_threads = 4)

combine main embedding and context embeddings (sum) into one matrix

glove_embedding = glove_embedding+ t(glove_model$components) # the transpose of the context matrix
Ruth
  • 1
  • 1
  • 2
    Ruth, you need to share a little of your data to help us find a solution. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Ben Aug 02 '23 at 15:32

0 Answers0