1

I have defined the function get_content to crawl data from https://www.investopedia.com/. I tried get_content('https://www.investopedia.com/terms/1/0x-protocol.asp') and it worked. However, the process seems to run infinitely on my Windows laptop. I checked that it runs well on Google Colab and Linux laptops.

Could you please elaborate why my function does not work in this parallel setting?

import requests
from bs4 import BeautifulSoup
from multiprocessing import dummy, freeze_support, Pool
import os
core = os.cpu_count() # Number of logical processors for parallel computing
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
session = requests.Session() 
links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    print(entry_name)

############ Parallel computing 
if __name__== "__main__":
    freeze_support()
    P_d = dummy.Pool(processes = core)
    P = Pool(processes = core)   
    #content_list = P_d.map(get_content, links)
    content_list = P.map(get_content, links)

Update1: I run this code in JupyterLab from Anaconda distribution. As you can see from below screenshot, the status is busy all the time.

enter image description here

Update2: The execution of the code can finish in Spyder, but it still returns no output.

enter image description here

Update3: The code runs perfectly fine in Colab:

enter image description here

Akira
  • 2,594
  • 3
  • 20
  • 45
  • 1
    Works fine for me after moving all the shared data into `if __name__ == "__main__":` (I ran it on Windows). – ggorlen Mar 02 '21 at 19:26
  • 1
    By the way, MP is probably overkill for this. Since you're just firing requests and most of the work is IO-bound, multithreading seems like a better fit because the thread can just babysit requests. MP is better for CPU-bound work. See [this answer](https://stackoverflow.com/a/65996653/6243352) – ggorlen Mar 02 '21 at 19:32
  • @ggorlen Can you elaborate on what "moving all the shared data ..." means? – Akira Mar 02 '21 at 19:34
  • See [this answer](https://stackoverflow.com/questions/24374288/where-to-put-freeze-support-in-a-python-script)--I assume this isn't an issue if you're on linux/mac and that it should run out of the box. I did pass the `session` object into each worker function, though, but I'm not sure that matters. – ggorlen Mar 02 '21 at 19:35
  • @ggorlen I tried to put the parallel code inside `if __name__ == "__main__":` but it did not work. [Here](https://colab.research.google.com/drive/1Oc2euId_QYHgr7TU3QzFznTPYc5UOwok?usp=sharing) is my modification. Yess, I run on Windows 10. Could you please post your code as an answer? – Akira Mar 02 '21 at 20:03
  • @ggorlen I've updated the code but to no avail. Do you have any idea about this error? I'm sorry for bothering you. – Akira Mar 03 '21 at 10:50
  • You're calling `main()` two times, the first call above `if __name__=="__main__":` leads to infinite recursion ([see](https://stackoverflow.com/q/52693216/9059420)). – Darkonaut Mar 03 '21 at 11:34
  • @Darkonaut I have modified it as [here](https://colab.research.google.com/drive/1Oc2euId_QYHgr7TU3QzFznTPYc5UOwok?usp=sharing) but it did not work. Could you please have a check on my code? – Akira Mar 03 '21 at 11:43
  • How did it "not work"? What's wrong with the output there? – Darkonaut Mar 03 '21 at 11:56
  • @Darkonaut The process just can not finish. It's always "busy". Please see my screenshot at https://i.stack.imgur.com/11x4K.png. – Akira Mar 03 '21 at 11:58
  • I can't tell from the screenshot. If this is still windows, the session you start below `if __name__ == "__main__"` isn't available to child processes. Count the number of printed outputs to figure out if it hangs with results outstanding or after all expected results are in. As @ggorlen already suggested, multiprocessing seems overkill here and you might get rid of problems with just switching to a thread-pool (`multiprocessing.dummy.Pool`) instead. – Darkonaut Mar 03 '21 at 12:22
  • Thank you @Darkonaut. It works perfectly with `dummy.Pool`. With `Pool`, it even does not produce any result for a child process. It's so weid. – Akira Mar 03 '21 at 12:55
  • @LEAnhDung how are you running your program. Some IDE's / IPython don't capture `print` statements from child processes. Also if you don't intend to `return` anything from your function, `P.apply` or `P.apply_async` may be more appropriate. – Aaron Mar 03 '21 at 14:33
  • @Aaron Please see the screenshot in my update. I actually want to fix the error with `multiprocessing.Pool` so that I can apply it later. – Akira Mar 03 '21 at 14:41
  • this is too short for an answer: For your own sanity, don't use Jupyter and multiprocessing. For some technical reasons, multiprocessing does not like interactive sessions, and Jupyter is strictly interactive mode. There are libraries that wrap multiprocessing to make it easier to use in interactive mode, but I usually just switch to a normal text editor / command prompt style IDE (I happen to use spyder which you already have if you use anaconda) – Aaron Mar 03 '21 at 14:43
  • @Aaron Please see my update in Spyder and Colab :)) – Akira Mar 03 '21 at 14:54
  • @LEAnhDung spyder uses IPython for the default shell which will not collect print statements from child processes. Change the "run" dialog to use an external system shell. Google Colab uses their own servers to actually execute the code, and are almost guaranteed to be Linux, so "fork" is the default "startmethod" which side-steps some of the "multiprocessing doesn't like interactive" problems. – Aaron Mar 03 '21 at 15:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229454/discussion-between-aaron-and-le-anh-dung). – Aaron Mar 03 '21 at 15:11

1 Answers1

2

Quite a bit to unpack here, but it basically all boils down to how python spins up a new process, and executes the function you want.

On *nix systems, the default way to create a new process is by using fork. This is great because it uses "copy-on-write" to give the new child process access to a copy of the parent's working memory. It is fast and efficient, but it comes with a significant drawback if you're using multithreading at the same time. Not everything actually gets copied, and some things can get copied in an invalid state (threads, mutexes, file handles etc). This can cause quite a number of problems if not handled correctly, and to get around those python can use spawn instead (also Windows doesn't have "fork" and must use "spawn").

Spawn basically starts a new interpreter from scratch, and does not copy the parent's memory in any way. Some mechanism must be used to give the child access to functions and data defined before it was created however, and python does this by having that new process basically import * from the ".py" file it was created from. This is problematic with interactive mode because there isn't really a ".py" file to import, and is the primary source of "multiprocessing doesn't like interactive" problems. Putting your mp code into a library which you then import and execute does work in interactive, because it can be imported from a ".py" file. This is also why we use the if __name__ == "__main__": line to separate any code you don't want to be re-executed in the child when the import occurs. If you were to spawn a new process without this, it could recursively keep spawning children (though there's technically a built-in guard for that specific case iirc).

Then with either start method, the parent communicates with the child over a pipe (using pickle to exchange python objects) telling it what function to call, and what the arguments are. This is why arguments must be picklable. Some things can't be pickled, which is another common source of errors in multiprocessing.

Finally on another note, the IPython interpreter (the default Spyder shell) doesn't always collect stdout or stderr from child processes when using "spawn", meaning print statements won't be shown. The vanilla (python.exe) interpreter handles this better.

In your specific case:

  • Jupyter lab is running in interactive mode, and the child process will have been created but gotten an error something like "can't import get_content from __main__". The error doesn't get displayed correctly because it didn't happen in the main process, and jupyter doesn't handle stderr from the child correctly
  • Spyder is using IPython, by default which is not relaying the print statements from the child to the parent. Here you can switch to the "external system console" in the "run" dialog, but you then must also do something to keep the window open long enough to read the output (prevent the process from exiting).
  • Google Colab is using a google server running Linux to execute your code rather than executing it locally on your windows machine, so by using "fork" as the start method, the particular issue of not having a ".py" file to import from is not an issue.
Aaron
  • 10,133
  • 1
  • 24
  • 40