2

Sorry if this question seems overly noob to you. I've taken programming courses but never one on computer architecture. I had to pretty much learn from Wiki/SO/Google.

I have a dict called LUT, and I need to parallelize its lookup (READ-ONLY). I have a list of items that I am scattering to multiple threads/processes, and each thread/process will then lookup LUT[item] for each item in its respective chopped-up list.

I can only think of 7 options to achieve this:

1. multithreading module, all threads lookup the same dict

2. multiprocessing module, all processes lookup the same dict

3. multiprocessing module, all processes lookup their own copy of dict, e.g. if there are 2 processes, there are 2 copies of the dict

4. multiprocessing module, all processes lookup a "shared proxy dict": Manager.dict

The following 3 options use Cython since I've heard it can be used to overcome Python's GIL.

5. Cython & C++'s STL unordered_map and multithreading, all threads lookup the same unordered_map

6. Cython & C++'s STL unordered_map and multiprocessing, all processes lookup the same unordered_map

7. Cython & C++'s STL unordered_map and multiprocessing, all processes lookup their respective copy of the unordered_map

I have already tried options 2, 3, & 4. 2 & 4 are around 100-1000x slower than serial lookup. Option 3 works well, but its memory usage is too high, since it makes use of multiple copies of the dictionary.

Options 5, 6, & 7 use Cython and its ability to extend with C++'s STL unordered_map, which is the C++-equivalent of Python's dict. Option 5 should technically overcome Python's GIL, but I am wondering if multithreading can really solve something that is CPU-bound. What is my best bet here?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
richizy
  • 2,002
  • 3
  • 21
  • 26
  • 1
    this makes no sense ... dicts have a lookup of O(1) ... why would you parellize it ... – Joran Beasley Feb 16 '14 at 20:54
  • I want to concurrently look up one dict. For example, I have a list of 10,000 items. Instead of serially looking up each item one at a time, I can parallelize it and look it up 2 at a time. – richizy Feb 16 '14 at 20:55
  • Oh I see ... hmmm I would recommend sending copies of the dictionary so that each thread has its own copy ... shared memory access tends to be pricey and almost certainly negates any speed gain unless your list is much much larger(maybe even then) ... this problem may be better suited to something like GO – Joran Beasley Feb 16 '14 at 20:56
  • 1
    If your dict being looked up is big, you could try using a NoSQL database like MongoDB. – Maciej Gol Feb 16 '14 at 21:00
  • shared memory is expensive and slow ... thats all there is too it ... one option maybe to store your dict in a postgres database ... that each thread creates a connection to (pretty sure postgres has pretty good concurrency support) – Joran Beasley Feb 16 '14 at 21:00
  • @kroolik that maybe easier than my suggestion of postgres ... but really probably any database program is roughly equivelent in terms of success for this particular task (I think they all support concurrent reading) – Joran Beasley Feb 16 '14 at 21:01
  • @JoranBeasley Thanks for your input, but can a thread utilize more than one core? [This post](http://stackoverflow.com/questions/4496680/python-threads-all-executing-on-a-single-core) says that all threads execute on one logical core because of Python's GIL. Hence I should use `unordered_map` to avoid the GIL? – richizy Feb 16 '14 at 21:05
  • @JoranBeasley, more or less, yeah. The part where MongoDB might be better for this use case than postgres is that it's schemaless so you can put literally any json in a single collection, and it's native data format is JSON itself. On the other hand. PostgreSQL has something similar called [hstore](http://www.postgresql.org/docs/9.1/static/hstore.html) used for key/value storage. – Maciej Gol Feb 16 '14 at 21:05
  • I suppose I should look into SQL/NoSQL; I am just afraid lookup will now be IO bound – richizy Feb 16 '14 at 21:07
  • 2
    Alternatively you can use [memcached](http://memcached.org/) or [redis](http://redis.io/). Both are in-memory key/value storages. – Maciej Gol Feb 16 '14 at 21:17
  • 2
    MultiThreading (in Python) won't help with something that is CPU-bound, only IO-bound. MultiProcessing sounds like overkill for a dictionary lookup. You know [premature optimization is the root of all evil](http://c2.com/cgi/wiki?PrematureOptimization), right? – martineau Feb 16 '14 at 22:04
  • Thanks everyone, I've learned immensely – richizy Feb 16 '14 at 23:07
  • If you only lookup a value once and you do not need pairs of values you can use `multiprocessing.Pool.map`. Could you explain more of how you need the values? – User Feb 17 '14 at 08:13
  • @User, the dictionary is actually an in-memory inverse-index `(word->doc-number)` i.e. if I have a query sentence, I can lookup all documents that contain any word within that sentence. Since I have 1000000s of sentences to query, I would like to split it up among my cores. – richizy Feb 18 '14 at 00:30
  • I plan to use `multiprocessing` as per @JoranBeasley, and to circumvent the memory issue, I'll just offload the actual documents into an HDF5/PyTables `table`, which is out-of-core storage, supports numeric data (since my documents are tfidf'ed), allows indexing, and allows concurrent READ-ONLY threads. I need an embedded datastore, hence aforementioned SQl/NoSQL wont be much of a help.. – richizy Feb 18 '14 at 00:32
  • 5 should be your best bet. Lookup is CPU-bound... per core. If you have multiple cores, multithreading will work faster than a single thread. (Assuming IO and synchronization at a higher level isn't a problem.) – AndrewS May 16 '14 at 18:38

0 Answers0