-4

I am doing some text mining using the tm package. I get ordered lists of words containing over 50,000 words. My corpus contains about 2 million words and I put all of them in a single document.

In order to save some memory and be able to get ngrams (2- and 3grams) with more terms I want to replace the words in the corpus with numbers. There are 2 ways I can do this.

1) for each word in my ordered list of words I can look up all locations in the corpus and replace that word with the number I want. This means I've to go through my document 50,000 times and each time check all 2 million word. That would be 100 billion compares.

2) for each of the 2 million words in the corpus do a lookup in my list of 50,000 words. With a binary search I should find the word in the list in at most 16 tries. That would mean I need to do only 32 million compares.

I've been looking around a bit on SO and using google. I find some suggestions for code and in C and C++. Now I can implement a binary text search myself without a problem, but I would prefer to use an existing package or function preferably one implementing parallel processing as well.

Any suggestions?

Etienne Moerman
  • 331
  • 1
  • 9
  • I don't get the down vote. What's the problem with this question? – Etienne Moerman Nov 28 '16 at 20:48
  • https://stackoverflow.com/help/on-topic – user3386109 Nov 28 '16 at 21:17
  • Thanks, I changed the title. How's this? – Etienne Moerman Nov 28 '16 at 21:28
  • 1
    check out the `?match()` function – Barker Nov 29 '16 at 01:32
  • Downvotes and Close Votes indicate that the post needs more improvement. Essentially you are asking if there is a package, if not write me a code. Which is "Tool request" and "do my job for me". Also, [reproducible data](http://stackoverflow.com/questions/5963269) would be helpful. – zx8754 Nov 29 '16 at 07:24
  • @zx8754 Sorry, I'm explicitly stating I can write the code myself. I'm asking if there is standard functionality. This is such a common task that I'm assuming some standard functionality exists. I just cannot find it and am asking for a name (package or function) to look further. – Etienne Moerman Nov 29 '16 at 08:36
  • @Barker the match() function does what I want functionally. However it says nothing about ordered list, so I'm assuming that it'll go over the list until it reaches a match. I'll give it a try though and compare it to the binary search I'm going to build myself. – Etienne Moerman Nov 29 '16 at 08:47
  • `match()` does not assume the list is sorted, but if the input list is sorted, it should additively increase the run time by `O(n)`, which since the actual matching algorithm is more than that, it should still have the same `O(n)` run time. – Barker Nov 29 '16 at 17:19
  • @Barker thanks. I built a rudimentary binary search and a I compared it to the match() function. For 900 lines of text the match function was a bit faster, for 2700 lines it was a lot faster, for 9000 lines it was a bit slower and for 18000 lines it was twice as slow. A binary search has a runtime O(log n), so for now I'm going for the binary search. I might use the match function in the last stretch of the binary search (if the need is there and if I've time). Thanks for your help. – Etienne Moerman Nov 29 '16 at 20:49
  • @EtienneMoerman Did I answer your question? If so please accept if not please put a comment. – A. Mashreghi Oct 13 '17 at 18:53
  • @A.Mashreghi Thanks for your help, but it wasn't the answer I was looking for. I don't need help implementing binary search algorithms. Since binary search is basic functionality I assumed it is already built somewhere. And I was right. I used a data.table and that has a binary search available. – Etienne Moerman Dec 29 '17 at 12:46

1 Answers1

3

You can do it using set data-structure; it works for your application. Or you can use standard binary search:

Set Data Structure in C++

Binary Search in C++

Also with the following code you can create as many threads as you wants; however, to my experience, 32 million operations is less than 1 second for C++ on most systems. C++ is really fast.

You divide the 2 million interval between as many threads as you want.

#include <string>
#include <iostream>
#include <thread>
#include <set>

using namespace std;

set <string> s; //Set also works in O(log n); no need for binary search
string corpus[2*1000*1000];
string dict[50000];

// The function we want to execute on the new thread.
void func(int start, int end)
{
    for(int i = start; i < end; i++){ // This is words in Corpus
        //============ YOU CAN ALSO WRITE YOUR BINARY SEARCH HERE ================

        // If you want you
        if(s.find(corpus[i]) != s.end()){
            //the word is in the 50000 dictionary
        }
        else {
            //It is not
        }
    }
}

int main()
{

    for(int i = 0; i < 50000; i++){
        s.insert(dict[i]);
    }

    // Constructs the new thread and runs it. Does not block execution.

    //=========== ADD AS MANY THREADS AS YOU WANT ======================
    thread t1(func, 1, 1000000);
    thread t2(func, 1000000, 2000000);

    // Makes the main thread wait for the new thread to finish execution, therefore blocks its own execution.
    t1.join();
    t2.join();
}
A. Mashreghi
  • 1,729
  • 2
  • 20
  • 33
  • Thanks. I was looking for something in R. I've never plugged in any other code into R. I really have to try that soon. For now, I'll write something in R and check if it's performance first. – Etienne Moerman Nov 29 '16 at 08:44
  • BTW the multi-threading code is much appreciated since I've never actually done that myself. – Etienne Moerman Nov 29 '16 at 08:51