1

My problem essentially boils down to "How do I get this bit of threaded Python code to run on my GPU instead of my CPU?"

I'm working on a program that's similar to a Travelling Salesman problem where I recursively check each possible move (with optimisations of course). The tricky thing is the retention of global variables - python's native threading does this very well, and the algorithm I'm using relies entirely on global variables - gross, I know. At the risk of going into too much detail, many of my threads will have to spawn separate threads of their own, up until a certain 'depth' (I've found around 3 works best), at which point each thread will no longer be parallelised, and the function will be executed linearly instead.

It worked decently well at first, and then I had some performance improvements from threading it. However, it's still not good enough - if I can maintain global variables, then in theory this program can be completely parallelised, and hence I think it would be able to run very quickly on a GPU.

The code's messy at the moment, but here's the general idea expressed in pseudocode:

int x

function f( depth ):         # THE RECURSIVE f( n ) TEMPLATE

   global x
   # do stuff with x

   if depth <= maxDepth then # if we're still below the max depth
                             #    then we'll thread the next round of recursion.

      for i = 0 to n         # this number will change each time

         call_in_thread( target = f,
                         args   = depth + 1
                         )   # obviously the arguments
                             # passed to each thread will be a little different,
                             # but that shouldn't be a problem

   else                      # if we're already past the max depth,
                             # then we won't bother parallelising,
                             #      as the overheads would outweigh the benefits

      for i = 0 to n         # 

         f( depth + 1 )      # THE SELF-RECURSIVE CALL

So my question is simple - can I (easily) translate from a threaded python program to a threaded python program that runs on my GPU while still maintaining use of global variables? I'm aware of the existence of Numba/NumbaPro, but they're very intimidating packages, and I'm not sure how well a program like mine would translate into that framework.

halfer
  • 19,824
  • 17
  • 99
  • 186
Bruno E
  • 156
  • 11
  • 1
    I think the answer to your question is "no". GPUs are not at all like CPUs, and generally speaking you can't take a CPU-style program and run it on a GPU; you'd need to redesign the program from scratch to take advantage of the GPU's architecture. Your best bet (if you don't want to spend a lot of time learning GPU programming) would be to find a library that already does what you want on a GPU, and use it. – Jeremy Friesner Jul 23 '19 at 20:29

1 Answers1

-1

Q: can I (easily) translate from a threaded python program to a threaded python program that runs on my GPU while still maintaining use of global variables?

A: Principally neither at all, the less easily :

Here is the WHY part :

a) the recursion, formally sketched in the pseudo-code above, is flawed. It does not control its own termination, but rather diverges into infinity, accumulating all the per-pass allocated resources stay and wait for the recursion deepest point-of-a-return, only since which it starts to release any of the such waiting resources allocations, as the recursion consolidation phase "bubbles" back to the initial caller.

b) python Threads ( as of 2019-Q3 ) are principally zero-concurrent, due to the GIL-lock prevented concurrency and python threading-concurrent processing knowingly operate worse ( due to new overheads added ) then a pure-[SERIAL] process-flow ( with an exception: where latency-masking ( hiding external or I/O related request/response delays and transport latencies ) may benefit from multi-threaded slow tasks separation into multiple threads, where the concurrency of waiting-times is the source of improvements, if compared to a pure-[SERIAL] ordering of both the request, the waiting time and the postprocessing of the response, after it has finally arrived (or not) )

c) a generic recursion is a form of a pure-[SERIAL] many levels deep ( sometimes very deep, yet never infinite ) nesting, with a deferred use of a deeper-level to be yet calculated value, during a "back-propagation" of a terminal value, all the way back, through a consolidation phase of the recursion. While specific chances are, a generic algorithm, that uses a form of recursion for self-expression, could hardly get parallelised, the less efficiently parallelised ( see the role of all add-on costs in the Amdahl's Law ), w.r.t. the add-on overhead costs for doing that. Plus a "density" of computing goes during the recursive calls way down, while resources needs grow all that way up. The rules of the Computer Science are quite clear in this. Watch and play with this fully interactive animated impact analysis and try to move the p-slider, for the [PARALLEL]-fraction of the processing, anywhere lower than the un-realistic 100% parallel-code as if having zero-instatiation add-on costs, as if having zero-transfer add-on costs, as if having zero-results-consolidation add-on costs, so as to at least "smell the smoke" of the real-life fire.

d) NUMA-ecosystems ( Non-Uniform Memory Access systems, the more the one provided via a proxy-(device-driver)-mediated access to a foreign language ( be it a CUDA c-language, an OpenCL, an SoC/FPGA-based hybrid one or any other, compiled into a static-GPU-kernel code, transferred into the "inside" of the GPU-card and "let" rather autonomously execute there ) hosted on a remote, distributed hardware, as GPU-s are an example of ) are hard to expect to have all the properties, native in the call-initiating sub-system, evenly available and working, the less a both-way coordinated propagation of changes in state/value. Using global variables on the python side is also strongly discouraged ( except known cases ) even inside a homogeneous python-only software

e) python scripts ( the programs that the python interpreter interpretes ), the less the threaded-ones, do not run on GPU. GPU-devices have also their units of code-execution called threads, but these are being executed on a principally different hardware (vendor-specific SM-based devices), that exhibit SIMD-parallelism, so autonomously operated threads, where each one has a different code to execute or even having the same code but a different ( divergent in GPU-jargon ) code-execution path, the SIMD-device will waste time in scheduling SIMD-executable threads and letting all the incoherent SIMD-threads waiting their (divergent) turn, till their instruction (different per-thread) will get all-SIMD "free" to execute, if no other SIMD-thread in the block has the very same SIMD-instruction to execute. One can guess the immense performance deterioration if divergent threads have to wait on ~ < 2 GHz SIMD-device to some future time, where the 32-threads-(warp)-wide SIMD-device gets all free from any other work and can first "privately" serve a one (divergent)-SIMD-thread. Yes, there are python-tools, that enable python program to prepare another, "embedded", block-of-code, that will then get transformed, then get compiled, then downloaded via the GPU-device queues to the fabric and scheduled for the remote execution on the GPU-device, yet these all transformations are complex, so as to make the high-level idea of how such a class of algorithms get transformed onto a foreign device computing device, using a foreign device processor's machine-code for execution there, while also having to obey all the known foreign-device-specific constraints on resources and processing-organisation policies.

f) GPU hosted kernel-code is indeed extremely sensitive to resources use, the more if recursive. The depth-of-recursion for any kind of recursive algorithms ( the one here is diverging, as noted in item a) ) may on its own even cause a correctly formulated recursive code straight impossible to be run inside a GPU-device, right due to the fixed and rather low amounts of high-performance on-chip memory resources ( size of the SM cache is very limited even on HPC-motivated GPU-fabrics if compared to sizes thereof on contemporary CPU-s, not speaking about the mainstream, primarily gaming-oriented, GPU-cards ).

user3666197
  • 1
  • 6
  • 50
  • 92