I have a data frame data1
with cleaned strings of text matched to their ids
# A tibble: 2,000 x 2
id text
<int> <chr>
1 decent scene guys visit spanish lady hilarious flamenco music background re…
3 movie beautiful plot depth kolossal scenes battles moral rationale br br conclusion wond…
4 fan scream killing astonishment story summarized don time move ii won regret plot ironical
5 mistake film guess minutes clunker fought hard stay seat lose hours life feeling br his…
6 phoned awful bed dog ranstuck br br positive grooming eldest daughter beeeatch br ous…
# … with 1,990 more rows
And have created a new variable freq
that for every word gives the tf, pdf and itidf. In order, the columns of freq
indicate id
, word
, n
, tf
, idf
, tf_idf
# A tibble: 112,709 x 6
id word n tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 335 starcrash 1 0.5 7.60 3.80
2 2974 carly 1 0.5 6.50 3.25
3 1796 phillips 1 0.5 5.81 2.90
4 1796 eric 1 0.5 5.40 2.70
5 1398 wilson 1 0.5 5.20 2.60
6 684 apolitical 1 0.333 7.60 2.53
7 1485 saimin 1 0.333 7.60 2.53
8 1398 charlie 1 0.5 4.77 2.38
9 2733 shouldn 1 0.5 4.71 2.36
10 2974 jones 1 0.5 4.47 2.23
# … with 112,699 more rows
I am trying to create a loop that goes through this second variable and uses word2vec to substitute in data1
any word of tf lower than the mean of all others, with the closest match.
I have tried the function
replace_word <- function(x) {
x<-hunspell_suggest(x)
x<-mutate(x)
p<-system.file(package = "word2vec", "models", "example.bin")
m<-read.word2vec(p)
s<-predict(m, x, type='nearest', top_n=1)
paste0(s)
}
But when I run it it goes into an infinite loop. I originally wanted to check whether the spelling of the word was correct first, but because there are words not in the dictionary I kept on getting errors. Because I have never done something like this before, I really don't know how to make it work. Could someone please help?
Thank you