0

I am having trouble with Selenium + Chromedriver in the Ubuntu environment in AWS (EC2 instance).

I'm using Chromedriver Linux64 version (wnload chromedriver for Linux: wget https://chromedriver.storage.googleapis.com/78.0.3904.70/chromedriver_linux64.zip). I've then placed Chromedriver in /usr/bin.

Chrome was downloaded for Ubuntu using sudo dpkg -i google-chrome-stable_current_amd64.deb If I verify the version of chrome using google-chrome --version I see that it's:

Google Chrome 78.0.3904.70 

The following Python code works, but the issue is that it only works sporadically.

options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1420,1080')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--remote-debugging-port=9222")
options.add_argument('--disable-gpu')


driver = webdriver.Chrome(chrome_options=options)

#Set base url (SAN FRANCISCO)
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='

#Build events array 
events = []
eventContainerBucket = []

for i in range(1,2):

    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)
    print(pageURL)

While the code above has worked without issue in the past, if I run it a few times, I end up getting the following error:

    Traceback (most recent call last):
  File "BandsInTown_Scraper_SF.py", line 84, in <module>
    driver = webdriver.Chrome(chrome_options=options)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created
from disconnected: unable to connect to renderer
  (Session info: headless chrome=78.0.3904.70)

I've read that to solve this issue, you may need to edit the etc/hosts file. I've looked here, and all looks good:

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255 broadcasthost
::1             localhost 

I'm also able to use requests and access urls through the server just fine. For example, the following text gives me no issues whatsoever:

url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page=6'
res = requests.get(url)
html_page = res.content

soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
print(text)

Another important piece of information that I believe could be causing this issue, is that Chromedriver may not be allowed to run in headless mode. For example, if I type chromedriver in terminal, I get this message:

Starting ChromeDriver 78.0.3904.70 (edb9c9f3de0247fd912a77b7f6cae7447f6d3ad5-refs/branch-heads/3904@{#800}) on port 9515
Only local connections are allowed.
Please protect ports used by ChromeDriver and related test frameworks to prevent access by malicious code.

Finally, if I try and do chmod 777 in /usr/bin, it says operation not permitted. Could this be part of the issue?

So, it appears Chrome + Chromedriver are the same version, so that's not the issue. Chromedriver and Selenium appear to be getting blocked. I'm a bit confused as to how to solve this.

halfer
  • 19,824
  • 17
  • 99
  • 186
DiamondJoe12
  • 1,879
  • 7
  • 33
  • 81
  • I would check if you can connect from server to url using `urllib.request` or `requests`. Some servers treat web scraping as illegal (as stealing data) so they may block access to external portals. – furas Nov 02 '19 at 17:47
  • furas, the odd thing is, this code has worked, and connected to the webpage. It's only after running a few times that it fails. I just tried this: response = requests.get('https://api.github.com') print(response) I get: So - the issue seems to not be with connecting to the server, right? – DiamondJoe12 Nov 02 '19 at 19:53
  • it may means that it has no problem to connect with servers. Problem after few tries can only means that github.com can block requests if they are too often or they reach some limit of reuqests or server blocks incorrect requests to its API. So problem with Chrome can be different. But I have no idea what is problem. Maybe AWS block Chrome for some reason. – furas Nov 03 '19 at 01:52
  • Furas, thanks. Actually, I may close this question. It appears the sole issue was this line, which I removed: options.add_argument("--remote-debugging-port=9222") – DiamondJoe12 Nov 03 '19 at 04:28
  • don't close it. Add your comment as answer. It can be useful for other users. – furas Nov 03 '19 at 04:29

1 Answers1

1

This is solved, the problem seems to have been on my end. Deleting this line:options.add_argument("--remote-debugging-port=9222") seemed to fix this issue. Thanks.

DiamondJoe12
  • 1,879
  • 7
  • 33
  • 81