1

I am scraping Google search pages using Python/Selenium, and since last night I have been encountering a MaxRetyError: p[Errno 61] Connection refused error. I debugged my code and found that the error begins in this code block right here"

domain = pattern.search(website)
counter = 2

# keep running this until the url appears like normal
while domain is None:
    counter += 1
    # close chrome and try again
    print('link not found, closing chrome and restarting ...\nwaiting {} seconds...'.format(counter))
    chrome.quit()
    time.sleep(counter)
    # chrome = webdriver.Chrome()
    time.sleep(10)                              ### tried inserting a timer.sleep to delay request
    chrome.get('https://google.com')            ### error is right here. This is the second instance of chrome.get in this script
    target = chrome.find_element_by_name('q')
    target.send_keys(college)
    target.send_keys(Keys.RETURN)

    # parse the webpage
    soup = BeautifulSoup(chrome.page_source, 'html.parser')

    website = soup.find('cite', attrs={'class': 'iUh30'}).text
    print('tried to get URL, is this it? : {}\n'.format(website))
    pattern = re.compile(r'\w+\.(edu|com)')
    domain = pattern.search(website)

I keep getting the following error:

raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='ADDRESS', port=PORT): Max retries exceeded with url: /session/92ca3da95353ca5972fb5c520b704be4/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11100e4e0>: Failed to establish a new connection: [Errno 61] Connection refused',))

As you can see in the code block above, I entered a timer.sleep() but it doesn't appear to help at all. For context, this script is part of a function, which in another script is repeatedly called in a loop. But again, I make sure to add delays in between each call of the webdriver.get() method. As of now, my script fails at the first iteration of this loop.

I tried googling the issue but the closest thing I found was this. It appears to speak to the same exact error, and the top answer identifies the same method that is causing the issue, but I don't really understand what the Solution and Conclusion sections are saying. I get that the MaxRetryError is confusing for debugging, but what precisely is the solution?

It mentions a max_retries argument, and Tracebacks, but I don't know what they mean in this context. Is there any way that I can catch this error (in the context of selenium)? I have some threads on Stack Exchange with mention of catching an error, but only in the context of urllib3. In my case, I would need to catch the same error for the Selenium package.

Thanks for any advice

im2wddrf
  • 551
  • 2
  • 5
  • 19

2 Answers2

0

My code still runs into issues every once in a while (which could be solved by using proxies), but I think I found the source of the issue. This loop anticipates that the first pattern match will return a .edu or .com, but does not anticipate for a .org. Therefore, my code runs indefinitely when the first search result returns a .org. Here is the source of the issue:

website = soup.find('cite', attrs={'class': 'iUh30'}).text
print('tried to get URL, is this it? : {}\n'.format(website))
pattern = re.compile(r'\w+\.(edu|com)') # does not anticipate .org's 

Now my code runs okay, though I do run into errors when the code runs for too long (in this case the source of the issue is much clearer).

im2wddrf
  • 551
  • 2
  • 5
  • 19
0

You are quitting the Chrome driver too early. After you call chrome.quit() it will cause subsequent calls to chrome.get('https://google.com') to fail and then automated retries lead to the MaxRetryError.

Try removing the call to chrome.quit().

mbonness
  • 1,612
  • 1
  • 18
  • 20