1

I am making a webscraping program that goes through each URL in a list of URLs, opens the page with that URL, and extracts some information from the soup. Most of the time it works fine, but occasionally the program will stop advancing through the list but not terminate the program, show warnings/exceptions, or otherwise show signs of error. My code, stripped down to the relevant parts, looks like this:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs

# some code...

for url in url_list:
    req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
    page = urlopen(req)
    soup = bs(page, features="html.parser")

    # do some stuff with the soup...

When the program stalls, if I terminate it manually (using PyCharm), I get this traceback:

File "/Path/to/my/file.py", line 48, in <module>
    soup = bs(page, features="html.parser")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 266, in __init__
    markup = markup.read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 454, in read
    return self._readall_chunked()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 564, in _readall_chunked
    value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 610, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1052, in recv_into
    return self.read(nbytes, buffer)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt

Here's what I have tried and learned:

  • Added a check to make sure that the page status is always 200 when making the soup. The fail condition never occurs.

  • Added a print statement after the soup is created. This print statement does not trigger after stalling.

  • The URLs are always valid. This is confirmed by the fact that the program does not stall on the same URL every time, and double confirmed by a similar program I have with nearly identical code that shows the same behavior on a different set of URLs.

  • I have tried running through this step-by-step with a debugger. The problem has not occurred in the 30 or so iterations I've checked manually, which may just be coincidence.

  • The page returns the correct headers when bs4 stalls. The problem seems to be isolated to the creation of the soup.

What could cause this behavior?

Tumblewood
  • 73
  • 2
  • 7
  • Please show us the **full** traceback! – Klaus D. Oct 12 '19 at 08:08
  • first use `print(page)` to see what you get from server when you have problem. – furas Oct 12 '19 at 08:35
  • @KlausD. i have taken your advice and updated my post accordingly – Tumblewood Oct 12 '19 at 17:44
  • Had the same problem [omitting file calls, given size of comment] `self.soup = bs(self.response, 'html.parser')` markup = markup.read() return self._readall_chunked() chunk_left = self._get_chunk_left() chunk_left = self._read_next_chunk_size() line = self.fp.readline(_MAXLINE + 1) return self._sock.recv_into(b) return self.read(nbytes, buffer) return self._sslobj.read(len, buffer) KeyboardInterrupt – B Furtado Jun 05 '20 at 19:54
  • It sure looks like a beautiful soup problem. I got a 200 response.status, and yet, no parsing. I guess the silent line is: `self.soup = bs(self.response, 'html.parser')` – B Furtado Jun 05 '20 at 21:11
  • @Tumblewood have you found an answer? I also got a 200 status, but silent parsing... I cannot print all of the requests until I can get a wrong one...can I? – B Furtado Jun 06 '20 at 15:12
  • 1
    @B Furtado I have managed to move past this, but I'm not satisfied with the solution. My current approach is to set a timeout using a custom TimeoutError class that will stop the parsing after a set number of seconds and retry. This only works because I'm on a Unix system (i.e. not Windows), and even then it's not satisfactory. I hope you find a better explanation of this phenomenon—and if you do, let me know! – Tumblewood Jun 06 '20 at 18:15
  • @Tumblewood No. I scrap every week. Last Friday, it worked perfectly. Today, it did not! What sucks is that I get no mail (when there is a mistake) in my cron job (or in the terminal. The thing is, you cannot see every page (when it is wokring) until it is not!!! Awful. Still... – B Furtado Jun 19 '20 at 20:41
  • @Tumblewood today it stalled again. How exactly do you "set a timeout using a custom TimeoutError class that will stop the parsing after a set number of seconds and retry". That might work with me, because I can easily let go same occasional page that does not work. I would then be able to examine said page to try to find what is wrong. Thanks. – B Furtado Jun 26 '20 at 17:29
  • @BFurtado I modeled it after [this answer](https://stackoverflow.com/a/53907894/8028530) to another question. – Tumblewood Jul 06 '20 at 00:06

0 Answers0