There is a dictionary data frame words.dict of approximately 44 thousand words, and the following code is supposed to substitute all the words in the dataset dataset.num for their numerical IDs from the dictionary.
data.num:
dput(head(dataset.num))
c("rt breaking will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers there may be periodic disruptions to the housing application portal sorry for any inconvenience", "hanging out in foiachat anyone have fav management software on the gov t side anything from intake to redaction onwards", "they left out kourtney instead they let chick from big bang talk", "i am encoding film like for the billionth time already ")
words.dict:
dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")
Loop:
for (i in 1:nrow(words.dict))
dataset.num <- gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num)
While I truncated the data, dataset.num is a character vector of almost 40 thousand lines (each line contains 20 words on average). The code works well on small data, but not so fast on large data with limited processing speed.
What would you suggest to improve the efficiency & performance of the code?