Q: can I (easily) translate from a threaded python program to a threaded python program that runs on my GPU while still maintaining use of global variables?
A: Principally neither at all, the less easily :
Here is the WHY part :
a) the recursion, formally sketched in the pseudo-code above, is flawed. It does not control its own termination, but rather diverges into infinity, accumulating all the per-pass allocated resources stay and wait for the recursion deepest point-of-a-return, only since which it starts to release any of the such waiting resources allocations, as the recursion consolidation phase "bubbles" back to the initial caller.
b) python Threads ( as of 2019-Q3 ) are principally zero-concurrent, due to the GIL-lock prevented concurrency and python threading-concurrent processing knowingly operate worse ( due to new overheads added ) then a pure-[SERIAL]
process-flow ( with an exception: where latency-masking ( hiding external or I/O related request/response delays and transport latencies ) may benefit from multi-threaded slow tasks separation into multiple threads, where the concurrency of waiting-times is the source of improvements, if compared to a pure-[SERIAL]
ordering of both the request, the waiting time and the postprocessing of the response, after it has finally arrived (or not) )
c) a generic recursion is a form of a pure-[SERIAL]
many levels deep ( sometimes very deep, yet never infinite ) nesting, with a deferred use of a deeper-level to be yet calculated value, during a "back-propagation" of a terminal value, all the way back, through a consolidation phase of the recursion. While specific chances are, a generic algorithm, that uses a form of recursion for self-expression, could hardly get parallelised, the less efficiently parallelised ( see the role of all add-on costs in the Amdahl's Law ), w.r.t. the add-on overhead costs for doing that. Plus a "density" of computing goes during the recursive calls way down, while resources needs grow all that way up. The rules of the Computer Science are quite clear in this. Watch and play with this fully interactive animated impact analysis and try to move the p
-slider, for the [PARALLEL]
-fraction of the processing, anywhere lower than the un-realistic 100% parallel-code as if having zero-instatiation add-on costs, as if having zero-transfer add-on costs, as if having zero-results-consolidation add-on costs, so as to at least "smell the smoke" of the real-life fire.
d) NUMA-ecosystems ( Non-Uniform Memory Access systems, the more the one provided via a proxy-(device-driver)-mediated access to a foreign language ( be it a CUDA c-language, an OpenCL, an SoC/FPGA-based hybrid one or any other, compiled into a static-GPU-kernel code, transferred into the "inside" of the GPU-card and "let" rather autonomously execute there ) hosted on a remote, distributed hardware, as GPU-s are an example of ) are hard to expect to have all the properties, native in the call-initiating sub-system, evenly available and working, the less a both-way coordinated propagation of changes in state/value. Using global
variables on the python side is also strongly discouraged ( except known cases ) even inside a homogeneous python-only software
e) python scripts ( the programs that the python interpreter interpretes ), the less the threaded-ones, do not run on GPU. GPU-devices have also their units of code-execution called threads, but these are being executed on a principally different hardware (vendor-specific SM-based devices), that exhibit SIMD-parallelism, so autonomously operated threads, where each one has a different code to execute or even having the same code but a different ( divergent in GPU-jargon ) code-execution path, the SIMD-device will waste time in scheduling SIMD-executable threads and letting all the incoherent SIMD-threads waiting their (divergent) turn, till their instruction (different per-thread) will get all-SIMD "free" to execute, if no other SIMD-thread in the block has the very same SIMD-instruction to execute. One can guess the immense performance deterioration if divergent threads have to wait on ~ < 2 GHz SIMD-device to some future time, where the 32-threads-(warp)-wide SIMD-device gets all free from any other work and can first "privately" serve a one (divergent)-SIMD-thread. Yes, there are python-tools, that enable python program to prepare another, "embedded", block-of-code, that will then get transformed, then get compiled, then downloaded via the GPU-device queues to the fabric and scheduled for the remote execution on the GPU-device, yet these all transformations are complex, so as to make the high-level idea of how such a class of algorithms get transformed onto a foreign device computing device, using a foreign device processor's machine-code for execution there, while also having to obey all the known foreign-device-specific constraints on resources and processing-organisation policies.
f) GPU hosted kernel-code is indeed extremely sensitive to resources use, the more if recursive. The depth-of-recursion for any kind of recursive algorithms ( the one here is diverging, as noted in item a) ) may on its own even cause a correctly formulated recursive code straight impossible to be run inside a GPU-device, right due to the fixed and rather low amounts of high-performance on-chip memory resources ( size of the SM cache is very limited even on HPC-motivated GPU-fabrics if compared to sizes thereof on contemporary CPU-s, not speaking about the mainstream, primarily gaming-oriented, GPU-cards ).