I am doing some text mining using the tm package. I get ordered lists of words containing over 50,000 words. My corpus contains about 2 million words and I put all of them in a single document.
In order to save some memory and be able to get ngrams (2- and 3grams) with more terms I want to replace the words in the corpus with numbers. There are 2 ways I can do this.
1) for each word in my ordered list of words I can look up all locations in the corpus and replace that word with the number I want. This means I've to go through my document 50,000 times and each time check all 2 million word. That would be 100 billion compares.
2) for each of the 2 million words in the corpus do a lookup in my list of 50,000 words. With a binary search I should find the word in the list in at most 16 tries. That would mean I need to do only 32 million compares.
I've been looking around a bit on SO and using google. I find some suggestions for code and in C and C++. Now I can implement a binary text search myself without a problem, but I would prefer to use an existing package or function preferably one implementing parallel processing as well.
Any suggestions?