Is there a way to release the GIL for pure functions using pure python?

Question

I think I must be missing something; this seems so right, but I can't see a way to do this.

Say you have a pure function in Python:

from math import sin, cos

def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    return (x, y)

is there some built-in functionality or library that provides a wrapper of some sort that can release the GIL during the function's execution?

In my mind I am thinking of something along the lines of

from math import sin, cos
from somelib import pure

@pure
def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    return (x, y)

Why do I think this might be useful?

Because multithreading, which is currently only attractive for I/O-bound programs, would become attractive for such functions once they become long-running. Doing something like

from math import sin, cos
from somelib import pure
from asyncio import run, gather, create_task

@pure  # releases GIL for f
async def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2 * t) - 2 * cos(3 * t) - cos(4 * t)
    return (x, y)


async def main():
    step_size = 0.1
    result = await gather(*[create_task(f(t / step_size))
                            for t in range(0, round(10 / step_size))])
    return result

if __name__ == "__main__":
    results = run(main())
    print(results)

Of course, multiprocessing offers Pool.map which can do something very similar. However, if the function returns a non-primitive / complex type then the worker has to serialize it and the main process HAS to deserialize and create a new object, creating a necessary copy. With threads, the child thread passes a pointer and the main thread simply takes ownership of the object. Much faster (and cleaner?).

To tie this to a practical problem I encountered a few weeks ago: I was doing a reinforcement learning project, which involved building an AI for a chess-like game. For this, I was simulating the AI playing against itself for > 100,000 games; each time returning the resulting sequence of board states (a numpy array). Generating these games runs in a loop, and I use this data to create a stronger version of the AI each time. Here, re-creating ("malloc") the sequence of states for each game in the main process was the bottleneck. I experimented with re-using existing objects, which is a bad idea for many reasons, but that didn't yield much improvement.

Edit: This question differs from How to run functions in parallel? , because I am not just looking for any way to run code in parallel (I know this can be achieved in various ways, e.g. via multiprocessing). I am looking for a way to let the interpreter know that nothing bad will happen when this function gets executed in a parallel thread.

Theoretically you could disable GIL by writing a C extension. But I can't really predict consequences of that. With that you are reaching an unsafe, undefined behaviour. For example who knows what exactly those sin, cos do and whether they will work correctly without GIL? The core issue here is that Python is slow and its threading support simply sucks. Perhaps you should switch to some other language? — freakish, Dec 04 '20 at 08:42
Python being python, you don't know what `t` is. If it's a single floating-point number, the cost of releasing and re-acquiring the GIL will be higher than the cost of 5 trig operations and a few basic math operations. Even in C++, where you don't have the GIL problem, it wouldn't make sense to create new `future`'s for such small tasks. A chess game between 2 AI's is another matter of course, but you want Tensorflow for that. The GIL doesn't apply to code running on the GPU. — MSalters, Dec 04 '20 at 08:48
Does this answer your question? [How to run functions in parallel?](https://stackoverflow.com/questions/7207309/how-to-run-functions-in-parallel) — mkrieger1, Dec 04 '20 at 08:49
@freakish I agree, just releasing the GIL without further thought is opening pandora's box. The first "safe" way that came to my mind was to create a new scope that doesn't have access to its parent in which the function gets executed and to then discard the scope except for the return variable. I didn't think this through though, esp. with respect to `import`s. I thought there might already be a library for this and I am not aware. — FirefoxMetzger, Dec 04 '20 at 09:01
@MSalters Yes, I'm already making use of Tensorflow to compute the moves. It doesn't solve all problems for a few reasons: (a) GPUs aren't good with branching, and a game like that branches A LOT; it really is CPU work. (b) predictions within a game are sequential (thanks to the high branching factor), (c) you can batch concurrently running games, but batching them in a multiprocess application is painful (current approach), (d) allowing each process a share of the GPU wastes precious GPU memory (copies of weights/parameters) — FirefoxMetzger, Dec 04 '20 at 09:10
@mkrieger1 Unfortunately, no. I'm already aware of `multiprocessing`. The "problem" (if you can call it that) is that this copies data under the hood leading to unwanted - and in my case intolerable - constant overhead. — FirefoxMetzger, Dec 04 '20 at 09:12
@FirefoxMetzger: Writing GPU code is a skill in itself. Branching often is a choice. For instance, evaluating chess trees can avoid branching to a degree by just evaluating all alternatives in parallel. Unlocking hidden parallelism is key to GPU speeds. — MSalters, Dec 04 '20 at 09:14
On a side note: even if it were possible to release the GIL, the asyncio code would still run sequentially because asyncio is for IO-based parallelism. (That asyncio is of no use is indicated by the `f` function not awaiting anything.) To get thread-based parallism in Python, one would turn to the `threading` module or (better) to `concurrent.futures.ThreadPoolExecutor`. — user4815162342, Dec 04 '20 at 14:05

user4815162342 · Accepted Answer · 2023-04-23T12:09:51.933

Is there a way to release the GIL for pure functions using pure python?

In short, the answer is no, because those functions aren't pure on the level on which the GIL operates.

GIL serves not just to protect objects from being updated concurrently by Python code, its primary purpose is to prevent the CPython interpreter from performing a data race (which is undefined behavior, i.e. forbidden in the C memory model, in which CPython executes) while accessing and updating global and shared data. This includes Python-visible singletons such as None, True, and False, but also all globals like modules, shared dicts, and caches. Then there is their metadata such as reference counts and type objects, as well as shared data used internally by the implementation.

Consider the provided pure function:

def f(t):
    x = 16 * sin(t) ** 3
    y = 13 * cos(t) - 5 * cos(2*t) - 2 * cos(3*t) - cos(4*t)
    return (x, y)

The dis tool reveals the operations that the interpreter performs when executing the function:

>>> dis.dis(f)
  2           0 LOAD_CONST               1 (16)
              2 LOAD_GLOBAL              0 (sin)
              4 LOAD_FAST                0 (t)
              6 CALL_FUNCTION            1
              8 LOAD_CONST               2 (3)
             10 BINARY_POWER
             12 BINARY_MULTIPLY
             14 STORE_FAST               1 (x)
             ...

To run the code, the interpreter must access the global symbols sin and cos in order to call them. It accesses the integers 2, 3, 4, 5, 13, and 16, which are all cached and therefore also global. In case of an error, it looks up the exception classes in order to instantiate the appropriate exceptions. Even when these global accesses don't modify the objects, they still involve writes because they must update the reference counts.

None of that can be done safely from multiple threads without synchronization. While it is conceivably possible to modify the Python interpreter to implement truly pure functions that don't access global state, it would require significant modifications to the internals, affecting compatibility with existing C extensions, including the vastly popular scientific ones. This last point is the principal reason why removing the GIL has proven to be so difficult.

Is there a way to release the GIL for pure functions using pure python?

1 Answers1

Linked