2

I am running a script that is scraping several hundred pages on a site but recently I have been running into IncompleteRead() errors. My understanding is from looking on stackoverflow is that they can happen for any number of unknown reasons.

The error is caused randomly by the Request() function I believe from searching around:

    for ec in unq:
        print(ec)
        url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                              ec, headers={'User-Agent': 'Mozilla/5.0'})
        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')


    3.5.2.3
    2.1.3.15
    2.5.1.72
    1.5.1.2
    6.1.1.9
    3.2.2.27
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 554, in _get_chunk_left
        chunk_left = self._read_next_chunk_size()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 521, in _read_next_chunk_size
        return int(line, 16)
    
    ValueError: invalid literal for int() with base 16: b''
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 571, in _readall_chunked
        chunk_left = self._get_chunk_left()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 556, in _get_chunk_left
        raise IncompleteRead(b'')
    
    IncompleteRead: IncompleteRead(0 bytes read)
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "<ipython-input-20-82f1876d3006>", line 5, in <module>
        html = urlopen(url).read()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 464, in read
        return self._readall_chunked()
    
      File "C:\Users\wmn262\Anaconda3\lib\http\client.py", line 578, in _readall_chunked
        raise IncompleteRead(b''.join(value))
    
    IncompleteRead: IncompleteRead(1772944 bytes read)

The error happens randomly, as in not always the same url causes it, with https://www.brenda-enzymes.org/enzyme.php?ecno=3.2.2.27 causing this specific one.

Some solutions seems to introduce a try clause but within the except they store the partial data (I think). Why is the the case, why not just resubmit the request?

If so how would I just re run the request as doing that normally seems to solve the issue. Beyond this I have no idea how I can fix the problem.

Lamma
  • 895
  • 1
  • 12
  • 26
  • Your question, as it is now, is basically unanswerable. Take a look at [ask] and show us your [mre]. – baduker Jan 11 '22 at 16:29
  • And don't spam with tags, there's nothing in your question that has anything to do with `beautifulSoup`. – baduker Jan 11 '22 at 16:30
  • @baduker Apologies, I completely spaced on adding the code as I was leaving in a hurry! I have added an MRE now and can provide a longer list of `unq` if that would help. It is ~1600 long so don't want to paste into a question here. – Lamma Jan 12 '22 at 08:58

2 Answers2

2

The stacktrace let think that you are reading a chunked tranfer encoded reponse and that for any reason you lost the connection between 2 chunks.

As you have said, this can happen for numerous causes, and the occurence is at random. So:

  • you cannot predict when or for what file it will happen
  • you cannot prevent it to happen

The best you can do is to catch the error and retry, after an optional delay.

For example:

for ec in unq:
    print(ec)
    url = Request("https://www.brenda-enzymes.org/enzyme.php?ecno=" +
                          ec, headers={'User-Agent': 'Mozilla/5.0'})
    for i in range(4):
        try:
            html = urlopen(url).read()
            break
        except http.client.IncompleteRead:
            if i == 3:
               raise       # give up after 4 attempts
            # optionaly add a delay here
    soup = BeautifulSoup(html, 'html.parser')
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Excellent, thank you for this. So the `try` idea was along the right tracks. What would adding a delay here do other than obviously add a delay? Also where do you get the exact error name from `http.client.IncompleteRead` ? – Lamma Jan 12 '22 at 09:38
  • @Lamma: Transmission errors can happen for various reasons including transcient problems. A delay, and if possible an increasing one, let some more time for the connection to be possible again. For `IncompleteRead` I had already found one there, so I just assumed it was the same as yours. – Serge Ballesta Jan 12 '22 at 09:41
  • I have added you suggestions to the question including an increasing delay. That should work as intended right? – Lamma Jan 12 '22 at 09:51
  • Regarding the error name I get `NameError: name 'http' is not defined`. Is there a simple way to find out full error name I a experiencing as the trace does not really seem to reveal it? – Lamma Jan 12 '22 at 10:04
  • 1
    @Lamma: you need to import the module in the current one with `import http.client`. Alternatively if you want to use `except IncompleteRead:` you can import only that reference with `from http.client import IncompleteRead` – Serge Ballesta Jan 12 '22 at 10:13
  • Just read that somewhere else too and added :D Thanks! I am running now and all seems to be well. – Lamma Jan 12 '22 at 10:15
0

I have faced with same issue and found this solution

After some little changes the code looks like here:

from http.client import IncompleteRead, HTTPResponse
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
...


def patch_http_response_read(func):
    def inner(args):
        try:
            return func(args)
        except IncompleteRead as e:
            return e.partial
    return inner

HTTPResponse.read = patch_http_response_read(HTTPResponse.read)

try:
    response = urlopen(my_url)
    result = json.loads(response.read().decode('UTF-8'))
except URLError as e:
    print('URL Error Reason: ', e.reason)
except HTTPError as e:
    print('HTTP Error code: ', e.code)
    

I'm not sure that it is a better way. But it works in my case. I'll be happy if this advice will be useful to you or help to you to found something different good solution. Happy coding!