This is a follow up to my previous question. As suggested by Tim Peters, using a Manager
may not necessarily be the best approach. Unfortunately I've got too much scaffolding code to post a SSCCE. Instead, I'll try to provide a detailed explanation of my problem. Please feel free to browse the entire codebase on Github, but it's a bit of a mess right now.
Background
I am doing research in Natural Language Processing and I'd like to do (something like) dictionary-based smoothing for document classification. The idea to train a classifier to associate words and phrases with a correct answer. For example, documents containing the word socialist
are likely to be about politics, and those containing the phrase lava temperature
are likely about geology. The system is trained by looking at a small number of pre-labelled examples. Because language is so varied, a classifier will never "know about" all possible phrases that it might encounter in production.
This is where the dictionary comes in. Suppose we had a cheap and easy way of getting synonyms for almost any phrase out there (I'll cite myself because it's poor taste). When the poor classifier is faced with a phrase it doesn't know about, we could look it up in said dictionary and tell the classifier "Look, you do not know about communism
, but it's kinda like socialist
, and you know about that!". If the dictionary is reasonable, the classifier will generally perform better.
Pseudo code
data = Load training and testing documents (300MB on disk)
dictionary = Load dictionary (200MB - 2GB on disk) and place into a `dict` for fast look-ups
Repeat 25 times:
do_work(data, dictionary)
def do_work(data, dictionary)
X = Select a random sample of data
Train a classifier on X
Y = Select a random sample of data
Using dictionary, classify all documents in Y
Write results to disk
The problem
The loop above is a perfect candidate for parallelisation. I have been using a Python 2.7 multiprocessing.Pool
(through joblib.Parallel
, because it’s easy and provides very useful traceback if things go south). All worker processes need read-only access to the dictionary and the document collection. There is no need for the workers to communicate with one another or with the parent process- all they do is spawn, do some magic, write a file and die.
The dictionary needs to support fast random access. I do not know what documents the sample Y
will contain, so I cannot easily prune the dictionary and pass just the part of it that is needed to each worker. The dictionary will be queried very often- typical hit counts per run are in the millions.
Currently my code is memory-bound as (I believe) copies of the document collection and dictionary are being made for each worker process. When parsed data
and dictionary
typically use up several GB of RAM. I’ve tried using multiprocessing.managers.BaseManager
to avoid copying the large objects, but that slowed the workers down.
The question
What other alternatives are there to speed things up? Things I have thought about include:
- MongoDB/CouchDB/memcached should handle concurrent access well, but I’m worried about throughput. zeromq was also suggested in a comment to my previous question, haven't had a chance to look into it.
- in-memory
sqlite
databases and database connections cannot be shared across processes, so each worker will need its own connection to an on-disk database. This means a lot of I/O at first and high memory usage as each worker's cache grows. - memory mapping
- using threads instead of processes
This SO question also suggested that many real-world problems that look like they need read-only access to a dict
may trigger fork()
's copy-on-write, so it may be impossible to completely avoid making copies of large objects.