4

This is a follow up to my previous question. As suggested by Tim Peters, using a Manager may not necessarily be the best approach. Unfortunately I've got too much scaffolding code to post a SSCCE. Instead, I'll try to provide a detailed explanation of my problem. Please feel free to browse the entire codebase on Github, but it's a bit of a mess right now.

Background

I am doing research in Natural Language Processing and I'd like to do (something like) dictionary-based smoothing for document classification. The idea to train a classifier to associate words and phrases with a correct answer. For example, documents containing the word socialist are likely to be about politics, and those containing the phrase lava temperature are likely about geology. The system is trained by looking at a small number of pre-labelled examples. Because language is so varied, a classifier will never "know about" all possible phrases that it might encounter in production.

This is where the dictionary comes in. Suppose we had a cheap and easy way of getting synonyms for almost any phrase out there (I'll cite myself because it's poor taste). When the poor classifier is faced with a phrase it doesn't know about, we could look it up in said dictionary and tell the classifier "Look, you do not know about communism, but it's kinda like socialist, and you know about that!". If the dictionary is reasonable, the classifier will generally perform better.

Pseudo code

data = Load training and testing documents (300MB on disk)
dictionary = Load dictionary (200MB - 2GB on disk) and place into a `dict` for fast look-ups
Repeat 25 times:
    do_work(data, dictionary)

def do_work(data, dictionary)
    X = Select a random sample of data
    Train a classifier on X
    Y = Select a random sample of data
    Using dictionary, classify all documents in Y
    Write results to disk

The problem

The loop above is a perfect candidate for parallelisation. I have been using a Python 2.7 multiprocessing.Pool (through joblib.Parallel, because it’s easy and provides very useful traceback if things go south). All worker processes need read-only access to the dictionary and the document collection. There is no need for the workers to communicate with one another or with the parent process- all they do is spawn, do some magic, write a file and die.

The dictionary needs to support fast random access. I do not know what documents the sample Y will contain, so I cannot easily prune the dictionary and pass just the part of it that is needed to each worker. The dictionary will be queried very often- typical hit counts per run are in the millions. Currently my code is memory-bound as (I believe) copies of the document collection and dictionary are being made for each worker process. When parsed data and dictionary typically use up several GB of RAM. I’ve tried using multiprocessing.managers.BaseManager to avoid copying the large objects, but that slowed the workers down.

The question

What other alternatives are there to speed things up? Things I have thought about include:

  • MongoDB/CouchDB/memcached should handle concurrent access well, but I’m worried about throughput. zeromq was also suggested in a comment to my previous question, haven't had a chance to look into it.
  • in-memory sqlite databases and database connections cannot be shared across processes, so each worker will need its own connection to an on-disk database. This means a lot of I/O at first and high memory usage as each worker's cache grows.
  • memory mapping
  • using threads instead of processes

This SO question also suggested that many real-world problems that look like they need read-only access to a dict may trigger fork()'s copy-on-write, so it may be impossible to completely avoid making copies of large objects.

Community
  • 1
  • 1
mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • The very latest joblib, 0.8, has a threading backend. **If** your code uses enough NumPy (or other C code that releases the GIL), that might be as fast as multiprocessing but with shared memory. – Fred Foo Jan 07 '14 at 13:25
  • 1
    Consider using the [`shelve`](http://docs.python.org/2/library/shelve.html#module-shelve) module. Its cached memory usage can be kept under control by periodically calling `Shelf.sync()` which should be fast if you're not modifying its content. – martineau Jan 07 '14 at 17:24
  • Try the simplest thing first: what happens if you "simply" create `data` and `dictionary` at module level, and let worker processes inherit copies via `fork()`? The SO post you linked to warning about reference counts is quite relevant here, but there's absolutely no way to guess how _much_ that matters for _your_ data and access patterns without trying it. The author of that post was, generally speaking. too pessimistic. – Tim Peters Jan 08 '14 at 01:34
  • This is an interesting idea---do you have a paper yet to cite? and how does it compare to LDA/dimensionality reduction for solving the oov problem? – Ben Allison Jan 08 '14 at 14:06

1 Answers1

2

The scenario you describe, you are likely to have large performance issues due to the GIL when using multi-threading. Probably to avoid that you chose to use multi processing instead. That, on the other hand, uses processes, so data structures might get copied for each subprocess.

I hate to say it, but using a non-Python solution (e. g. in C++) might speed up things because there you do not have the GIL problem. Then you can use multi-threading, do not have to copy things etc. Reading from a large dictionary from several threads is not really an issue, so you won't have to synchronize anything (what the GIL always would do for you without a real need).

Alfe
  • 56,346
  • 20
  • 107
  • 159