Some time ago I posted a question about saving/retrieving a Suffix Tree from disk. That's finally working fine, but now the construction is extremely slow, and I don't want to mess with the Ukkonen's algorithm (linear construction) right now.
So, I wanted to make concurrent insertions to accelerate the process without having to make the tree thread-safe.
The Suffix Tree stores words by its initial character (look the image posted on my previous question), thus, the word Banana is in the 'B' child of the root node, and Apple would be in the 'A' child and so forth. So, the insertion of a word starting with 'B' will never interfere with a insertion starting with 'A'. My idea is to have a thread for each initial character of the set of words being inserted: a thread inserting 'A's, another thread inserting 'B's, etc.
So I was thinking about a Executer
class, that just adds words to queue of words of each
Process
(first create it if it doesn't exists).
class Executer:
#...
def concurrent_insertion(word):
k = word[0]
processes.get(k, Process()).add(word)
# ...
And the class Process
is the one that makes the insertions. Each Process
instance is a independent thread, with a Queue
containing the words that still has to insert.
In this Process
class is where I'm having troubles, I'm guessing it should inherit from threading.Thread
, because each instance should be a thread, but how do I keep it alive until all the text processing is done? I mean, it should be inserting words from its Queue
of words, but when the Queue
is empty the thread should not die, just keep waiting till more words fill the Queue
, "wake up" and continue inserting.