I have a list of 15 million strings and I have a dictionary of 8 million words. I want to replace every string in database by the index of the string in the dictionary. I tried using the hash package for faster indexing, but it is still taking hours for replacing in all 15 million strings. What is the efficient way of implementing this?
Example[EDITED]:
# Database
[[1]]
[1]"a admit been c case"
[[2]]
[1]"co confirm d ebola ha hospit howard http lik"
# dictionary
"t" 1
"ker" 2
"be" 3
.
.
.
.
# Output:
[[1]]123 3453 3453 567
[[2]]6786 3423 234123 1234 23423 6767 3423 124431 787889 111
Where the index of admit
in the dictionary is 3453
.
Any kind of help is appreciated.
Updated Example with Code:
This is what I am currently doing.
Example: data =
[1] "a co crimea divid doe east hasten http polit secess split t threaten ukrain via w west xtcnwl youtub"
[2] "billion by cia fund group nazy spent the tweethead ukrain"
[3] "all back energy grandpar home miss my posit radiat the"
[4] "ao bv chega co de ebola http kkmnxv pacy rio suspeito t"
[5] "android androidgam co coin collect gameinsight gold http i jzdydkylwd t ve"
words.list = strsplit(data, "\\W+", perl=TRUE)
words.vector = unlist(words.list)
sorted.words = sort(table(words.vector),decreasing=TRUE)
h = hash(names(sorted.words),1:length(names(sorted.words)))
index = lapply(data, function(row)
{
temp = trim.leading(row)
word_list = unlist(strsplit(temp, "\\W+", perl=TRUE))
index_list = lapply(word_list,function(x)
{
return(h[[x]])
}
)
#print(index_list)
return(unlist(index_list))
}
)
Output:
index_list
[[1]]
[1] 6 1 19 21 22 23 31 2 40 44 46 3 48 5 51 52 53 54 55
[[2]]
[1] 12 14 16 26 30 38 45 4 49 5
[[3]]
[1] 7 11 25 29 32 36 37 41 42 4
[[4]]
[1] 10 13 15 1 20 24 2 35 39 43 47 3
[[5]]
[1] 8 9 1 17 18 27 28 2 33 34 3 50
The output is index. This runs fast if the length of data is small but execution is really slow if the length is 15 million. My task is the nearest neighbor search. I want to search for 1000 queries which are of same format as the database. I have tried many things like parallel computations as well, but had issues with memory.
[EDIT] How can I implement this using RCpp?