Externalising CPU computation from Python for multi-core concurrency

Question

I have a PyQt5 application which runs perfectly on my development machine (Core i7 Windows 7), but has performance issues on my target platform (Linux Embedded ARM). I've been researching Python concurrency in further detail, prior to 'optimising' my current code (i.e. ensuring all UI code is in the MainThread, with all logic code in separate threads). I've learnt that the GIL largely prevents the CPython interpreter from realising true concurrency.

My question: would I be better off using IronPython or Cython as the interpreter, or sending all the logic to an external non-Python function which can make use of multiple cores, and leave the PyQt application to simply update the UI? If the latter, which language would be well suited to high-speed, concurrent calculation?

The prototype hardware has 4 cores, which we can assume is representative of the production target platform as well. — jars121, Oct 09 '17 at 21:07
Sometimes on Linux it can be better to restrict a Python application to a single core to make it faster. You can try to start the application with https://linux.die.net/man/1/taskset to see if this helps. — Michael Butscher, Oct 09 '17 at 21:16
I've been reading about the concurrent.futures and mutliprocessing.pool options as well, which look promising. — jars121, Oct 09 '17 at 22:39

errantlinguist · Accepted Answer · 2017-10-15T20:26:41.487

2

If the latter, which language would be well suited to high-speed, concurrent calculation?

You've written a lot about your system and yet not enough about what it actually does; what kind of "calculations" are you doing? — If you're doing anything heavily computational, it's very likely someone has worked very hard to make a hardware-optimized library to do these kinds of calculations, e.g. BLAS via scipy/numpy (see Arm's own website). You want to push as much work out of your own Python code and into their hands. The language you use to call these libraries is much less important. Python is already great for this kind of "gluing" work for such libraries. Note that even using built-in Python functions, such as using sum(value for value in some_iter) instead of summing in a Python for loop, also pushes computation out of slow interpretation and into highly-optimized C code.

Otherwise, without profiling your actual code, it's hard to say what would be best. After doing the above by efficiently formulating your calculations in a way that optimized libraries can best do their work (e.g. by properly vectorizing them), you can then use Python's multiprocessing to divide up whatever Python logic is causing a bottleneck from that which isn't (see this answer on why multiprocesing is often better than threading). I'd wager this would be much more beneficial than just swapping out CPython for another implementation.

Only once you've delegated as much computation to external libraries as possible and paralllelized as well as possible using multiprocessing would I then start writing these computation-heavy processes in Cython, which could be considered a type of low-level optimization over the aforementioned architectural improvements.

edited Oct 15 '17 at 20:26

answered Oct 09 '17 at 23:01

errantlinguist

3,658
4
18
41

Very good point, thank you. This is a data acquisition and display system; it checks the status of a number of sensors, records their value (for plotting), and updates their respective GUI element accordingly. The data acquisition component is already in an external Python file, which is run on a separate thread from the MainThread. The MainThread still contains non-GUI logic, which is largely the focus of this question. I could externalise that logic as well, but my research suggests that threading.Thread() doesn't provide the concurrency one would think. – jars121 Oct 09 '17 at 23:05
Python's multi*threading* isn't about performance because, as you already stated, the GIL essentially cripples it; it's mainly used to maintain responsiveness in GUI components. If your Python code still has a lot of non-GUI logic, I'd try putting that in separate *processes* which then asynchronously call back to the relevant GUI components. But don't create tons of tiny processes, or at least not until you profile the changes from using a few. – errantlinguist Oct 09 '17 at 23:15
Ok great, thanks, you've validated my interpretation of threading vs. processes. I imagine then that I'll still use multithreading to ensure GUI responsiveness, but will then incorporate multiprocessing to actually execute the work in the background. Once the work has been executed, I then emit a custom signal, which causes a GUI event on the MainThread. – jars121 Oct 09 '17 at 23:18
Yes, just have each process implement a callback function for your relevant widget/component, which live in their own GUI thread or threads-- It's a pretty common pattern. The multi-threading-GUI-vs.-multiprocessing thing is somewhere on SO in more detail but I'm too tired and lazy to find it. – errantlinguist Oct 09 '17 at 23:35
But don't overdo the number of processes. – errantlinguist Oct 09 '17 at 23:41
As each additional process comes with overhead, that makes sense. Thanks again! – jars121 Oct 09 '17 at 23:49
@jars121 did this info help you in the end? If so, I'd be greatly appreciative if you [accepted it](https://stackoverflow.com/help/someone-answers). – errantlinguist Oct 12 '17 at 07:45
I haven't had a chance to put what I've learnt into practise, but I confident it's the answer I was looking for. Thank you! – jars121 Oct 15 '17 at 09:22
@jars121 happy to oblige. "Architectural optimization" is one of the interests I have which isn't warranted often enough... – errantlinguist Oct 15 '17 at 11:37
I completely understand! I've not developed for an embedded device before, so am only now learning the importance of optimising every single facet of the design to compensate for limited hardware capability. – jars121 Oct 16 '17 at 01:27

score 0 · Answer 2 · answered Oct 10 '17 at 00:49

echoing @errantlinguist, please be aware that parallel performance is highly application-dependent.

To maintain GUI responsiveness, yes, I would just use a separate "worker" thread to keep the main thread available to handle GUI events.

To do something "insanely parallel", like a Monte Carlo computation, where you have many many completely independent tasks which have minimal communication between them, I might try multiprocessing.

If I were doing something like very large matrix operations, I would do it multithreaded. Anaconda will automatically parallelize some numpy operations via MKL on intel processors (but this will not help you on ARM). I believe you could look at something like numba to help you with this, if you stay in python. If you are unhappy with performance, you may want to try implementing in C++. If you use almost all vectorized numpy operations, you should not see a big difference from using C++, but as python loops, etc. start to creep in, you will probably begin to see big differences in performance (beyond the max 4x you will gain by parallelizing your python code over 4 cores). If you switch to C++ for matrix operations, I highly recommend the Eigen library. It's very fast and easy to understand at a high-level.

Please be aware that when you use multithreading, you are usually in a shared memory context, which eliminates a lot of the expensive io you will encounter in multiprocessing, but it also introduces some classes of bugs you are not used to encountering in serial programs (when two threads begin to access the same resources). In multiprocessing, memory is usually separate, except for explicitly defined communications between the processes. In that sense, I find that multiprocessing code is typically easier to understand and debug.

Also there are frameworks out there to handle complex computational graphs, with many steps, which may include both multithreading and multiprocessing (try dask).

Good luck!

Externalising CPU computation from Python for multi-core concurrency

2 Answers2