8

I've looked at and tried nearly every other post on this topic with no luck.

EC2

I'm using python 3.6 so I'm using the following AMI amzn-ami-hvm-2018.03.0.20181129-x86_64-gp2 (see here). Once I SSH into my EC2, I download Chrome with:

sudo curl https://intoli.com/install-google-chrome.sh | bash
cp -r /opt/google/chrome/ /home/ec2-user/
google-chrome-stable --version
# Google Chrome 86.0.4240.198 

And download and unzip the matching Chromedriver:

sudo wget https://chromedriver.storage.googleapis.com/86.0.4240.22/chromedriver_linux64.zip
sudo unzip chromedriver_linux64.zip

I install python36 and selenium with:

sudo yum install python36 -y
sudo /usr/bin/pip-3.6 install selenium

Then run the script:

import os
import selenium
from selenium import webdriver

CURR_PATH = os.getcwd()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--remote-debugging-port=9222')
chrome_options.binary_location = f"{CURR_PATH}/chrome/google-chrome"
driver = webdriver.Chrome(
    executable_path = f"{CURR_PATH}/chromedriver",
    chrome_options=chrome_options
)
driver.get("https://www.google.com/")
html = driver.page_source
print(html)

This works

Lambda

I then zip my chromedriver and Chrome files:

mkdir tmp
mv chromedriver tmp
mv chrome tmp
cd tmp
zip -r9 ../chrome.zip chromedriver chrome

And copy the zipped file to an S3 bucket

This is my lambda function:

import os
import boto3
from botocore.exceptions import ClientError
import zipfile
import selenium
from selenium import webdriver

s3 = boto3.resource('s3')

def handler(event, context):
    chrome_bucket = os.environ.get('CHROME_S3_BUCKET')
    chrome_key = os.environ.get('CHROME_S3_KEY')
    # DOWNLOAD HEADLESS CHROME FROM S3
    try:    
        # with open('/tmp/headless_chrome.zip', 'wb') as data:
        s3.meta.client.download_file(chrome_bucket, chrome_key, '/tmp/chrome.zip')
        print(os.listdir('/tmp'))
    except ClientError as e:
        raise e
    # UNZIP HEADLESS CHROME
    try:
        with zipfile.ZipFile('/tmp/chrome.zip', 'r') as zip_ref:
            zip_ref.extractall('/tmp')
        # FREE UP SPACE
        os.remove('/tmp/chrome.zip')
        print(os.listdir('/tmp'))
    except:
        raise ValueError('Problem with unzipping Chrome executable')
    # CHANGE PERMISSION OF CHROME
    try:
        os.chmod('/tmp/chromedriver', 0o775)
        os.chmod('/tmp/chrome/chrome', 0o775)
        os.chmod('/tmp/chrome/google-chrome', 0o775)
    except:
        raise ValueError('Problem with changing permissions to Chrome executable')
    # GET LINKS
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--window-size=1280x1696')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--hide-scrollbars')
    chrome_options.add_argument('--enable-logging')
    chrome_options.add_argument('--log-level=0')
    chrome_options.add_argument('--v=99')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument('--remote-debugging-port=9222')
    chrome_options.binary_location = "/tmp/chrome/google-chrome"
    driver = webdriver.Chrome(
        executable_path = "/tmp/chromedriver",
        chrome_options=chrome_options
    )
    driver.get("https://www.google.com/")
    html = driver.page_source
    print(html)

I'm able to see my unzipped files in the /tmp path.

And my error:

{
  "errorMessage": "Message: unknown error: unable to discover open pages\n",
  "errorType": "WebDriverException",
  "stackTrace": [
    [
      "/var/task/lib/observer.py",
      69,
      "handler",
      "chrome_options=chrome_options"
    ],
    [
      "/var/task/selenium/webdriver/chrome/webdriver.py",
      81,
      "__init__",
      "desired_capabilities=desired_capabilities)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      157,
      "__init__",
      "self.start_session(capabilities, browser_profile)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      252,
      "start_session",
      "response = self.execute(Command.NEW_SESSION, parameters)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      321,
      "execute",
      "self.error_handler.check_response(response)"
    ],
    [
      "/var/task/selenium/webdriver/remote/errorhandler.py",
      242,
      "check_response",
      "raise exception_class(message, screen, stacktrace)"
    ]
  ]
}

EDIT: I am willing to try out anything at this point. Different versions of Chrome or Chromium, Chromedriver, Python or Selenium.

EDIT2: The answer below did not solve the problem.

CPak
  • 13,260
  • 3
  • 30
  • 48
  • The Chrome installer almost certainly does more than dumps a bunch of files in a directory. – tripleee Nov 16 '20 at 11:22
  • Thanks for pointing this out. You're right that the installer does more than dump files, but I think the files are the only thing that matters. For instance, I can upload the built Google chrome/chromedriver files into a fresh EC2, and the python script works. – CPak Nov 16 '20 at 11:38
  • Chrome (and thus Selenium) needs a display driver to run; are you using something like Xvfb or how are you arranging this? – tripleee Nov 16 '20 at 11:44
  • It's running `--headless`? Even if you're correct and I'm not taking into account some kind of display driver, I'm using the same machine image that the Lambda uses. Are you suggesting that the Lambda is missing a display driver that the EC2 includes? – CPak Nov 16 '20 at 11:48
  • If you want to use puppeteer instead of Selenium https://www.npmjs.com/package/chrome-aws-lambda (Chromium Binary for AWS Lambda and Google Cloud Functions) – Rahul L Nov 19 '20 at 06:08
  • There is sample project available on git https://github.com/vittorio-nardone/selenium-chromium-lambda which is using serverless-chrome https://github.com/adieuadieu/serverless-chrome . May give some information – Rahul L Nov 19 '20 at 06:42
  • [This post](https://stackoverflow.com/a/21004947/10625611) suggests outdated `chromedriver`. – Qumber Nov 25 '20 at 06:39

4 Answers4

5

This error message...

"errorMessage": "Message: unknown error: unable to discover open pages\n",
"errorType": "WebDriverException"

...implies that the ChromeDriver was unable to initiate/spawn a new Browsing Context i.e. Chrome Browser session.

It seems the issue is with ChromeDriver,s security feature of Sandboxing.


Thumb rule

A common cause for Chrome to crash during startup is running Chrome as root user (administrator) on Linux. While it is possible to work around this issue by passing --no-sandbox flag when creating your WebDriver session, such a configuration is unsupported and highly discouraged. You need to configure your environment to run Chrome as a regular user instead.


Details

A bit of more details about your usecase would have helped us to analyze the usage of the arguments which you have used and the root cause of the error in a better way. However, a few thoughts:

  • What is the sandbox?: The sandbox is a C++ library that allows the creation of sandboxed processes — processes that execute within a very restrictive environment. The only resources sandboxed processes can freely use are CPU cycles and memory. For example, sandboxes processes cannot write to disk or display their own windows. What exactly they can do is controlled by an explicit policy. Chromium renderers are sandboxed processes.
  • What does and doesn't it protect against?: The sandbox limits the severity of bugs in code running inside the sandbox. Such bugs cannot install persistent malware in the user‘s account (because writing to the filesystem is banned). Such bugs also cannot read and steal arbitrary files from the user’s machine. (In Chromium, the renderer processes are sandboxed and have this protection. After the NPAPI removal, all remaining plugins are also sandboxed. Also note that Chromium renderer processes are isolated from the system, but not yet from the web. Therefore, domain-based data isolation is not yet provided.). The sandbox cannot provide any protection against bugs in system components such as the kernel it is running on.
  • So how can a sandboxed process such as a renderer accomplish anything?: Certain communication channels are explicitly open for the sandboxed processes; the processes can write and read from these channels. A more privileged process can use these channels to do certain actions on behalf of the sandboxed process. In Chromium, the privileged process is usually the browser process.

So you may need to drop the --no-sandbox option. Here is the link to the Sandbox story.


Additional Considerations

Some more considerations:

  • While using --headless option you won't be able to use --window-size=1280x1696 due to certain constraints.

You can find a couple of relevant detailed discussion in:

You can find a relevant detailed discussion in ERROR:gpu_process_transport_factory.cc(1007)-Lost UI shared context : while initializing Chrome browser through ChromeDriver in Headless mode

  • Further you haven't mentioned any specific requirement of using --disable-dev-shm-usage, --hide-scrollbars, --enable-logging, --log-level=0, --v=99, --single-process and --remote-debugging-port=9222 arguments which you opt to drop for the time being and add them back as per your Test Specification.

References

You can find a couple of relevant detailed discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Thanks for the response. I am already using the `--no-sandbox` argument (1st and 2nd link). – CPak Nov 18 '20 at 21:40
  • 1
    If you could identify why `--no-sandbox` doesn't work for me, that would be a great help. Thanks! – CPak Nov 18 '20 at 21:49
  • @CPak Checkout the updated answer and let me know the result and your thoughts about them. – undetected Selenium Nov 18 '20 at 22:18
  • Thanks, I initially tried with only `--no-sandbox` and `--headless` and that did not work. I only added the many other options because other people (on StackO) had suggested it fixed their issue. – CPak Nov 19 '20 at 19:08
  • @CPak Can you retest once dropping `--no-sandbox` and the other unnecessary arguments? – undetected Selenium Nov 19 '20 at 19:57
  • Sure, could you clarify which arguments I should use for testing? `--headless` only? – CPak Nov 19 '20 at 20:18
  • @CPak To start within the _AWS Lambda_ environment in **headless** mode, use only the `--headless` argument. Once your testbed is well established, in the next steps you can add the other arguments as per the _Test Specifications_/requirements. – undetected Selenium Nov 19 '20 at 20:22
  • Your hypothesis is that it will work using only the `--headless` option? And that the other options are making it fail? – CPak Nov 19 '20 at 20:43
  • @CPak Let's discuss the issue in details within [Selenium Chat Room](https://chat.stackoverflow.com/rooms/223360/selenium) – undetected Selenium Nov 19 '20 at 20:48
2

I was finally able to get it to work

Python 3.7
selenium==3.14.0
headless-chromium v1.0.0-55
chromedriver 2.43

Headless-Chromium

https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-55/stable-headless-chromium-amazonlinux-2017-03.zip

Chromedriver

https://chromedriver.storage.googleapis.com/2.43/chromedriver_linux64.zip

I added headless-chromium and chromedriver to a Lambda Layer

Permissions 755 for both works

Lambda

The Lambda function looks like this

import os
import selenium
from selenium import webdriver


def handler(event, context):
    print(os.listdir('/opt'))
    # 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.binary_location = f"/opt/headless-chromium"
    driver = webdriver.Chrome(
        executable_path = f"/opt/chromedriver",
        chrome_options=chrome_options
    )
    driver.get("https://www.google.com/")
    html = driver.page_source
    driver.close()
    driver.quit()
    print(html)

Hope this helps someone in Q4 2020 and after.

CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thanks, it worked perfectly for me!. Two questions: 1) How do you know what version of chromedriver is compatible with a specific chromium version? 2) What configuration of memory do you have for this lambda? – David López Jan 10 '21 at 20:08
  • 1
    In general, you look at the version of Chromium that's included in the version. For instance, look at https://chromedriver.chromium.org/downloads to see that `v.1.0.0-55` includes Chromium `69/70`. Then you look at the corresponding Chromedriver at https://chromedriver.chromium.org/downloads to see that `v2.43` supports Chrome `69-71`. You should explicitly test of course, but this gets you in the right neighborhood. – CPak Jan 11 '21 at 00:42
  • 1
    I specified 1 GB of RAM. It's not optimized (that is, I didn't test the minimum memory required), but it worked, so I moved on. – CPak Jan 11 '21 at 00:44
1

The answer of @CPak worked for me, I only had to copy the headless-chromium and chromedriver to /tmp and grant permissions, the rest of the code is the same:

from shutil import copyfile

def permissions(origin_path, destiny_path):
    copyfile(origin_path, destiny_path)
    os.chmod(destiny_path, 0o775)

    
def lambda_handler(event, context):
    permissions('/opt/chromedriver','/tmp/chromedriver')
    permissions('/opt/headless-chromium','/tmp/headless-chromium')
David López
  • 500
  • 5
  • 21
1

I'm a big fan of this answer because a few months ago allows me to properly run a serverless scraper on AWS Lambda. But a few days ago this implementation began to fail, and traveling for hours and hours of searching I got to the conclusion that the binaries given here by @CPak (for chrome version 69) are too old to run on "modern" websites.

I found in this GitHub repo a file called chromium.zip, which is the headless-chromium binary for version 86.0.4240.0. And here I downloaded the matching chromedriver. With these two files replacing the @Cpak answer or mine given previously the implementation should work.

I'm still trying to find where to obtain the most recent versions of the headless chromium binaries when these versions stopped working. When I find it it'll post here.

David López
  • 500
  • 5
  • 21