2

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.

This is divided into tasks where each task is to find each of a series of needles in the haystack and each task is logically independent from the other tasks.

(This is already distributed across multiple machines where each machine has its own copy of the haystack.)

There are many ways this could be parallelized on individual machines.

We could have one search process per CPU core sharing memory. Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.

3 possible architectures:

  1. A process loads the haystack into Posix shared memory.

    Subsequent processes use the shared memory segment instead (like a cache)

  2. A process loads the haystack into memory and then forks.

    Each process uses the same memory because of copy on write semantics.

  3. A process loads the haystack into memory and spawns multiple search threads

The question is one method likely to be better and why? or rather what are the trade offs.

(For argument's sake assume performance trumps implementation complexity).

Implementing two or three and measuring is possible of course but hard work. Are there any reasons why one might be definitively better?

  • Data in the haystack is immutable.
  • The processes are running on Linux. So processes are not significantly more expensive than threads.
  • The haystack spans many GBs so CPU caches are not likely to help.
  • The search process is essentially a binary search (actually equal_range with a touch of interpolation).
  • Because the tasks are logically independent there is no benefit from inter-thread communication being cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).

I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?


Background research

The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.

Related and interesting but different questions are:

An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf

It suggests there can be a small difference in the speed of thread vs process context switches. I am assuming that except for a monitoring threads/process the others are never switched out.

Bruce Adams
  • 4,953
  • 4
  • 48
  • 111

2 Answers2

3

General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.

Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.

Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.

Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.

Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • All sound and relevant advice but it does not answer the question "is there any reason to choose between a threaded architecture vs a shared memory architecture". Perhaps that is a red herring as the question is whether we are limited by cpu or memory bandwidth. https://stackoverflow.com/questions/2952277/when-is-a-program-limited-by-the-memory-bandwidth hints at how to determine that. – Bruce Adams Oct 28 '19 at 17:10
2

The modern approach is to use threads and a single process.

Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.

The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.

It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).

But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.

If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.

You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

rghome
  • 8,529
  • 8
  • 43
  • 62
  • 1
    I don't think the multiple-process approach is dead at all. It scales across multiple machines where threads cannot (unless you have a very clever runtime) for example. In my case the cross machine machinery is already in place. I'm interested in performance variations on a single node. Maintenance costs are relevant too but secondary. – Bruce Adams Oct 29 '19 at 14:25
  • Yes you need different processes on different machines. But then you can't use shared memory, so that is not the case we are talking about. – rghome Oct 29 '19 at 15:02
  • Actually you can. You can share a file over the network and memory map it locally on each machine. This is not recommended unless you keep it read-only of course. – Bruce Adams Oct 29 '19 at 16:21