Replace every word by an index in 15 million strings

Question

I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary. I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings. What is the efficient way of implementing this?

Example[EDITED]:

# Database
[[1]]
[1]"a admit been c case" 
[[2]] 
[1]"co confirm d ebola ha hospit howard http lik"

# dictionary
 "t" 1
 "ker" 2
 "be" 3
  .
  .
  .
  .

# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111

Where the index of admit in the dictionary is 3453.

Any kind of help is appreciated.

Updated Example with Code: This is what I am currently doing. Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub" [2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"

words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))

index = lapply(data, function(row) 
    {
      temp = trim.leading(row)
      word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
      index_list = lapply(word_list,function(x)
         {
            return(h[[x]])
         }
         )
         #print(index_list)
        return(unlist(index_list))
    }
)
Output:
index_list
[[1]]
 [1]  6  1 19 21 22 23 31  2 40 44 46  3 48  5 51 52 53 54 55

[[2]]
 [1] 12 14 16 26 30 38 45  4 49  5

[[3]]
 [1]  7 11 25 29 32 36 37 41 42  4

[[4]]
 [1] 10 13 15  1 20 24  2 35 39 43 47  3

[[5]]
 [1]  8  9  1 17 18 27 28  2 33 34  3 50

The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million. My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database. I have tried many things like parallel computations as well, but had issues with memory.

[EDIT] How can I implement this using RCpp?

I did try %in%. But it take very long time to complete the execution. — Darshan, Dec 01 '14 at 08:36
I think your best bet is to use the data.table package. Provide a sample of you dictionary and someone will almost certainly show you how to do it. — jlhoward, Dec 01 '14 at 08:37
I tried representing the dictionary using hash package and also as named lists. But both took lots of time. — Darshan, Dec 01 '14 at 08:44
Take a look at the `tm` (text mining) package. I'm not sure about performance on corpora of that size, but could be worth a shot. Specifically, you might be interested in the *Dictionary* section of [the vignette](http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf). — jbaums, Dec 01 '14 at 09:13
Please provide a **minimal, self contained example**. Check these links for general ideas, and how to do it in R: [**here**](http://stackoverflow.com/help/mcve), [**here**](http://www.sscce.org/), [**here**](http://adv-r.had.co.nz/Reproducibility.html), and [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). — Henrik, Dec 01 '14 at 10:16
It would also help if you could tell us what you'll do with the data once its been merged. i.e. it's another format might be the best way to use your data. — Tommy O'Dell, Dec 01 '14 at 10:23

score 2 · Answer 1 · answered Dec 01 '14 at 13:17

I think you'd like to avoid the lapply() by splitting the data, unlisting, then processing the vector of words

data.list = strsplit(data, "\\W+", perl=TRUE)
words = unlist(data.list)
## ... additional processing, e.g., strip white space, on the vector 'words'

perform the match, then re-list to original

relist(match(words, word.vector), data.list)

For downstream applications it might actually pay to retain the vector + 'partitioning' information, partition = sapply(data.list, length) rather than re-listing, since it'll continue to be efficient to operate on the unlisted vector. The Bioconductor S4Vectors package provides a CharacterList class that takes this approach, where one mostly works on something that is list-like, but where the data are stored and most operations are on an underlying character vector.

smci · Answer 2 · 2014-12-01T08:55:41.403

0

Sounds like you're doing NLP.

A fast non-R solution (which you could wrap in R) is word2vec

The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

edited Dec 01 '14 at 08:55

answered Dec 01 '14 at 08:50

smci

32,567
20
113
146

Is there a package in R language which I can use? – Darshan Dec 01 '14 at 08:59
Sorry don't know. Presumably there is. – smci Dec 01 '14 at 09:41

Replace every word by an index in 15 million strings

2 Answers2