1

I'm trying to get Selenium-Wire to work in an AWS Lambda. I've seen very few StackOverflow entries about it, but it kinda seems some people were successful. My lambda is stateless and doesn't even need to use any other AWS feature (such as S3). It'd scrape a certain thing an d I'd capture a specific JSON response of a specific AJAX call on a page.

Here is my Dockerfile:

FROM public.ecr.aws/lambda/python:3.9
# Should I go with python:3.8 instead?

# Install the function's dependencies using file requirements.txt
# from your project folder.

RUN yum makecache
# https://stackoverflow.com/questions/73056540/no-module-named-amazon-linux-extras-when-running-amazon-linux-extras-install-epe
RUN yum install -y amazon-linux-extras

# https://stackoverflow.com/questions/72077341/how-do-you-install-chrome-on-amazon-linux-2
RUN PYTHON=python2 amazon-linux-extras install epel -y

# https://stackoverflow.com/questions/72850004/no-package-zbar-available-in-lambda-layer
RUN yum makecache
RUN yum install -y chromium
ENV CHROMIUM_PATH=/usr/bin/chromium-browser
# or RUN yum install -y google-chrome-stable
# or https://intoli.com/blog/installing-google-chrome-on-centos/
# curl https://intoli.com/install-google-chrome.sh | bash
# https://devopsqa.wordpress.com/2018/03/08/install-google-chrome-and-chromedriver-in-amazon-linux-machine/

# https://www.usessionbuddy.com/post/How-To-Install-Selenium-Chrome-On-Centos-7/
RUN yum install -y chromedriver

RUN pip install --upgrade pip

COPY requirements.txt .
RUN  pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "app.handler" ]

My requirements.txt, pretty minimal:

selenium-wire==5.1.0

And my Lambda function:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service

def handler(event, context):
  # https://gist.github.com/rengler33/f8b9d3f26a518c08a414f6f86109863c
  # https://github.com/wkeeling/selenium-wire/issues/131
  chrome_options = webdriver.ChromeOptions()

  chrome_option_list = {
    "disable-extensions",
    "disable-gpu",
    "no-sandbox",
    "headless", # for Jenkins
    "disable-dev-shm-usage", # Jenkins
    "window-size=800x600", # Jenkins
    "window-size=800,600",
    "disable-setuid-sandbox",
    "allow-insecure-localhost",
    "no-cache",
    "user-data-dir=/tmp/user-data",
    "hide-scrollbars",
    "enable-logging",
    "log-level=0",
    "single-process",
    "data-path=/tmp/data-path",
    "ignore-certificate-errors",
    "homedir=/tmp",
    "disk-cache-dir=/tmp/cache-dir",
    "start-maximized",
    "disable-software-rasterizer",
    "ignore-certificate-errors-spki-list",
    "ignore-ssl-errors",
  }

  for chrome_option in chrome_option_list:
    chrome_options.add_argument(f"--{chrome_option}")

  selenium_options = {
    "request_storage_base_dir": "/tmp", # Use /tmp to store captured data
    "exclude_hosts": ""
  }

  ser = Service("/usr/bin/chromedriver")
  ser.service_args=["--verbose", "--log-path=test.log"]

  driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)

  # The meat
  # ...

  return result

I built an image from the docker file and uploaded it to AWS ECR. The Docker image passes the "it works on my machine (TM)" classic test: it scrapes fine in my laptop Docker container. However it returns error when I try to run it as lambda (based on my own image):

START RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b Version: $LATEST
[ERROR] WebDriverException: Message: Service /usr/bin/chromedriver unexpectedly exited. Status code was: 1
Traceback (most recent call last):
  File "/var/task/app.py", line 43, in handler
    driver = webdriver.Chrome(service=ser, options=chrome_options, seleniumwire_options=selenium_options)
  File "/var/task/seleniumwire/webdriver.py", line 218, in __init__
    super().__init__(*args, **kwargs)
  File "/var/task/selenium/webdriver/chrome/webdriver.py", line 80, in __init__
    super().__init__(
  File "/var/task/selenium/webdriver/chromium/webdriver.py", line 101, in __init__
    self.service.start()
  File "/var/task/selenium/webdriver/common/service.py", line 104, in start
    self.assert_process_still_running()
  File "/var/task/selenium/webdriver/common/service.py", line 117, in assert_process_still_running
    raise WebDriverException(f"Service {self.path} unexpectedly exited. Status code was: {return_code}")
END RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b
REPORT RequestId: 3f767106-e6f5-4c5c-8930-e77b7314eb3b  Duration: 758.10 ms Billed Duration: 1361 ms    Memory Size: 128 MB Max Memory Used: 91 MB  Init Duration: 602.74 ms    

I was also experimenting with other Chrome switches such as mentioned in selenium.common.exceptions.webdriverexception: message: 'chromedriver.exe' unexpectedly exited.status code was: 1 with no luck. I always get Status code 1, but I couldn't find any documentation what is that exactly. I assume it's some very blatant error.

Does anyone have a working image / Dockerfile + skeleton function I can try?

Csaba Toth
  • 10,021
  • 5
  • 75
  • 121

0 Answers0