0

There is a dictionary data frame words.dict of approximately 44 thousand words, and the following code is supposed to substitute all the words in the dataset dataset.num for their numerical IDs from the dictionary.

data.num:

dput(head(dataset.num))
c("rt   breaking  will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers  there may be periodic disruptions to the housing application portal  sorry for any inconvenience", "hanging out in  foiachat  anyone have fav  management software on the gov t side  anything from intake to redaction   onwards", "they left out kourtney  instead they let chick from big bang talk", "i  am  encoding  film   like  for the  billionth time already ")

words.dict:

dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")

Loop:

for (i in 1:nrow(words.dict))

    dataset.num <-  gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num) 

While I truncated the data, dataset.num is a character vector of almost 40 thousand lines (each line contains 20 words on average). The code works well on small data, but not so fast on large data with limited processing speed.

What would you suggest to improve the efficiency & performance of the code?

Nal
  • 121
  • 2
  • 15
  • 1
    Can you provide a minimal example of the dataset using `dput(droplevels(head(dataset.num)))`? – talat Apr 21 '16 at 08:26
  • Have you tried to make use of the `apply` function? Its essentially a vectorized implementation of a `for` loop and will be much faster – Hanjo Odendaal Apr 21 '16 at 08:30
  • 1
    @HanjoJo'burgOdendaal `apply` is *not* a "vectorized implementation of a for loop" and is not "much faster". Actually, it's a wrapper of a R `for` loop. See the source code of `apply`. Where did you get those false information? – nicola Apr 21 '16 at 08:34
  • A minimal reproducible example would help, but the package `dplyr` implements some C++ and could maybe help you getting faster here...? – ztl Apr 21 '16 at 08:37
  • @nicola, thinking that apply is a 'vectorized implementation' seems to be a common misconception. Here is a nice like to a discussion around this - http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized. You learn new things every day – Hanjo Odendaal Apr 21 '16 at 08:45
  • Hard to judge without a reproducible example, but is using 'merge' not an option here, probably preceded by some data-cleaning? – Heroka Apr 21 '16 at 08:49
  • @docendodiscimus added. – Nal Apr 21 '16 at 09:00

1 Answers1

1

Here's a different approach, which perhaps scales better, though I haven't really tested it.

sapply(strsplit(dataset.num, "\\s+"), function(y) {
  i <- match(y, words.dict$word)
  y[!is.na(i)] <- words.dict$id[na.omit(i)]
  paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"                                                                           
#[2] "i hope you like wine 12 24"                                                                                       
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"     
#[5] "they left out kourtney instead they let 23 from 20 19 talk"                                                       
#[6] "i 11 26 28 like for the 21 time 10"

Note that you could use stringi::stri_split to speed up the string splitting.

talat
  • 68,970
  • 21
  • 126
  • 157