1

I am downloading and unzipping many large files in parallel using threading, but as I understand it, the GIL limits how much of my CPU I can actually use.

When I learned about Linux in school, I remember that we had a lab in which we spawned a lot of processes using foo.py & in the command line. These processes used up all of our CPU power.

Currently, I am working in Windows, and I wonder whether I can use the subprocess module to also spawn multiple Python processes, each with its own GIL. I would split my list of download links into, say, four roughly equal lists and pass one of these sub-lists to each of four sub-processes. Then each subprocess would use threading to further speed up my downloads. I'd do the same for the unzipping, which takes even longer than the downloading.

Am I conceptualizing subprocesses correctly, and is it possible that this approach would work for my downloading and unzipping purposes?

I've searched around SO and other web resources, but I've not found much addressing such a hacky approach to multi-processing multi-threading. There was this question, which said that the main program doesn't communicate with subprocesses once the latter are spawned, but for my purposes, I would only need each subprocess to send a "finished" flag to back to the main program.

Thank you!

ODK
  • 41
  • 5
  • 2
    The GIL won't substantially impact downloading -- it's only CPU time spent in Python itself, not I/O time, that matters. So it _could_ help with the actual unzip, but even then you don't need to spawn off extra Python processes for the unzip; you could just use `subprocess.Popen(['unzip', yourfilename])` or similar to use your system-native unzip tool. And if you _do_ want to spawn separate Python processes, the `multiprocessing` module is the native way to do that built into upstream Python; generally, I'd suggest using that if you can instead of jumping direct to roll-your-own. – Charles Duffy Dec 07 '20 at 18:56
  • Anyhow, I'm not completely clear what specific, narrow question you're asking that the answers to the question you link haven't already covered. – Charles Duffy Dec 07 '20 at 19:00
  • 1
    (One concrete reason to use `multiprocessing` instead of `subprocess` is that you can create a fixed-size pool -- if you have 20 CPUs and 100 files, f/e, you can tell `multiprocessing` to create exactly 20 subprocesses, and when they finish their work pass them new/additional files to unzip until the entire set of 100 files have been processed by those 20 processes it made; whereas with a naive use of `subprocess` you'd need to create one process per file and pay startup costs repeatedly). – Charles Duffy Dec 07 '20 at 19:02
  • (BTW, it's not just that the GIL won't impact downloading; it _also_ won't impact disk I/O -- one Python thread can be doing work while another one is waiting for a read or a write to finish -- and it _also_ won't impact use of any C libraries that release the GIL while they're working, which ones built for high-performance computing generally will; so Python threads are often more useful than someone who hasn't yet benchmarked for their real-world workload in practice might expect). – Charles Duffy Dec 07 '20 at 19:04
  • (Not that the GIL can't force a redesign or even a rewrite away from Python, but be sure it's biting you in practice, not just in theory, before you make big decisions with it as your decision factor). – Charles Duffy Dec 07 '20 at 19:06
  • @CharlesDuffy, thank you for your answer. It sounds like I might not be understanding the GIL correctly. From your first comment, it sounds like the GIL impacts how many downloads Python can launch at once, but once the files are downloading, they're limited by I/O speeds. Why doesn't Python launch all of the downloads at once? My logging indicates that I have around ten active downloads at a time. Can it be because the files are small, so by the time the tenth download starts, the first one finishes and a new one takes its place? – ODK Dec 07 '20 at 19:11
  • No, the GIL doesn't meaningfully impact how many downloads Python can launch at once. – Charles Duffy Dec 07 '20 at 19:15
  • ...as for specific timing you're seeing, I'd need to actually have a chance to observe (and instrument) a test case. For all I know from here, it could be that the remote server only allows 10 downloads at a time from a given client IP address, no matter what your client's software stack is like. – Charles Duffy Dec 07 '20 at 19:16
  • @CharlesDuffy, your explanations are really helpful. I did not realize that `multiprocessing` is so similar to `subprocess`. My misunderstanding came from my experience in Linux, when running `foo.py &` opened a new terminal window; since multiprocessing doesn't do that (as far as I can see), I thought it was working fundamentally differently. One other question: for my unzip I'm using `shutil.copyfileobj`. Is that a Python-specific function, which would lock the GIL? – ODK Dec 07 '20 at 19:16
  • `foo.py &` doesn't usually open a new terminal window, unless `foo.py` itself does something that opens a window (like invoking a new copy of xterm / gnome-terminal / etc). – Charles Duffy Dec 07 '20 at 19:17
  • Yes, `shutil.copyfileobj` is a Python-native function, but even then, it'll only be holding the GIL while it's _actually in Python code_, not when it runs a syscall to ask the operating system to do I/O on Python's behalf. And if it's, say, copying from the network to disk, _almost all_ of its time will be waiting on syscalls (requests for the operating system to do something) to complete. – Charles Duffy Dec 07 '20 at 19:18
  • ...so the place where `copyfileobj` can spend a lot of time with the GIL held is the case where one of your objects (either the source or destination one) is something that's _not_ just doing disk or network I/O, but is instead doing something like compressing or decompressing content (in native Python code instead of with a C library that releases the GIL while it's working, which is what one would expect a high-performance implementation to do). – Charles Duffy Dec 07 '20 at 19:20
  • Thank you, @CharlesDuffy. All of your answers were really helpful. – ODK Dec 07 '20 at 19:33

0 Answers0