I have some Python code that leverages ctypes.CDLL, according to the docs this does not involve the gil. With that being said, I am experiencing some bottlenecks that I am unclear of when profiling. If I run some trivial code using time.sleep or even ctypes.windll.kernel32.Sleep I can see the time scale equally as the number of threads matches the number of tasks, in other words if the task is to sleep 1 second and I submit 1 task in 1 thread or 20 tasks in 20 threads they both take ~1 second to complete.
Switching back to my code, it is not scaling out as expected but rather linearly. Profiling indicates waits from acquire() in _thread.lock.
What are some techniques to further dig into this to see where the issue is manifesting? Is ThreadPoolExecutor not the optimal choice here? I understood it implemented a basic thread pool and was no different than ThreadPool from multiprocessing.pool?