I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# some code...
for url in url_list:
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
page = urlopen(req)
soup = bs(page, features="html.parser")
# do some stuff with the soup...
When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:
File "/Path/to/my/file.py", line 48, in <module>
soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Here's what I have tried and learned:
Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.
Added a print statement after the soup is created. This print statement does not trigger after stalling.
The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.
I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.
The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.
What could cause this behavior?