2

I am running a script on AWS (Ubunut) EC2 instance. It's a web scraper that uses selenium/chromedriver and headless chrome to scrape some webpages. I've had this script running previously with no problems, but today I'm getting an error. Here's the script:

options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1420,1080')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument("--disable-notifications")

options.binary_location='/usr/bin/chromium-browser'
driver = webdriver.Chrome(chrome_options=options)


#Set base url (SAN FRANCISCO)
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='

events = []

for i in range(1,90):
    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)
    print(pageURL)

When I run this script from ubuntu, I get this error:

  Traceback (most recent call last):
  File "BandsInTown_Scraper_SF.py", line 91, in <module>
    driver = webdriver.Chrome(chrome_options=options)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1295, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I confirmed that I'm running the same version of Chromedriver/Chromium Browser:

ChromeDriver 79.0.3945.130 (e22de67c28798d98833a7137c0e22876237fc40a-refs/branch-heads/3945@{#1047})


Chromium 79.0.3945.130 Built on Ubuntu , running on Ubuntu 18.04

For what it's worth, I have this running on a mac, and I do have multiple web scraping scripts like this one running on the same EC2 instance (only 2 scripts so far, so not that much).

Update

I'm now getting these errors as well when trying to run this script on ubuntu:

    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 141, in _new_conn
        (self.host, self.port), self.timeout, **extra_kw)
      File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 60, in create_connection
        for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
      File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
        for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    socket.gaierror: [Errno -3] Temporary failure in name resolution


     During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 852, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 284, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
        timeout=timeout
    ^[[B  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
    ^[[B^[[A^[[A    _stacktrace=sys.exc_info()[2])
      File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "BandsInTown_Scraper_SF.py", line 39, in <module>
        res = requests.get(url)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 75, in get
        return request('get', url, params=params, **kwargs)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 60, in request
        return session.request(method=method, url=url, **kwargs)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
        r = adapter.send(request, **kwargs)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

Finally, here's my currently monthly AWS usage, which doesn't show any memory quota being exceed.

enter image description here

halfer
  • 19,824
  • 17
  • 99
  • 186
DiamondJoe12
  • 1,879
  • 7
  • 33
  • 81

2 Answers2

3

This error message...

    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

...implies that the operating system was unable to allocate memory to initiate/spawn a new session.

Additionally, this error message...

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.bandsintown.com', port=443): Max retries exceeded with url: /en/c/san-francisco-ca?page=6 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f90945757f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

...implies that your program have successfully iterated till Page 5 and while on Page 6 you see this error.


I don't see any issues in your code block as such. I have taken your code, made some minor adjustments and here is the execution result:

  • Code Block:

    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='
    for i in range(1,10):
        #cycle through pages in range
        driver.get(base_url + str(i))
        pageURL = base_url + str(i)
        print(pageURL)
    
  • Console Output:

    https://www.bandsintown.com/en/c/san-francisco-ca?page=1
    https://www.bandsintown.com/en/c/san-francisco-ca?page=2
    https://www.bandsintown.com/en/c/san-francisco-ca?page=3
    https://www.bandsintown.com/en/c/san-francisco-ca?page=4
    https://www.bandsintown.com/en/c/san-francisco-ca?page=5
    https://www.bandsintown.com/en/c/san-francisco-ca?page=6
    https://www.bandsintown.com/en/c/san-francisco-ca?page=7
    https://www.bandsintown.com/en/c/san-francisco-ca?page=8
    https://www.bandsintown.com/en/c/san-francisco-ca?page=9
    

Deep dive

This error is coming from subprocess.py:

self.pid = _posixsubprocess.fork_exec(
    args, executable_list,
    close_fds, tuple(sorted(map(int, fds_to_keep))),
    cwd, env_list,
    p2cread, p2cwrite, c2pread, c2pwrite,
    errread, errwrite,
    errpipe_read, errpipe_write,
    restore_signals, start_new_session, preexec_fn)

However, as per the discussion in OSError: [Errno 12] Cannot allocate memory this error OSError: [Errno 12] Cannot allocate memory is related to RAM / SWAP.


Swap Space

Swap Space is the memory space in the system hard drive that has been designated as a place for the to temporarily store data which it can no longer hold with in the RAM. This gives you the ability to increase the amount of data your program can keep in its working . The swap space on the hard drive will be used primarily when there is no longer sufficient space in RAM to hold in-use application data. However, the information written to I/O will be significantly slower than information kept in RAM, but the operating system will prefer to keep running application data in memory and use swap space for the older data. Deploying swap space as a fall back for when your system’s RAM is depleted is a safety measure against out-of-memory issues on systems with non-SSD storage available.


System Check

To check if the system already has some swap space available, you need to execute the following command:

$ sudo swapon --show

If you don’t get any output, that means your system does not have swap space available currently. You can also verify that there is no active swap using the free utility as follows:

$ free -h

If there is no active swap in the system you will see an output as:

Output
               total        used       free        shared      buff/cache  available
Mem:           488M         36M        104M        652K        348M        426M
Swap:            0B          0B          0B

Creating Swap File

In these cases you need to allocate space for swap to use as a separate partition devoted to the task and you can create a swap file that resides on an existing partition. To create a 1 Gigabyte file you need to execute the following command:

$ sudo fallocate -l 1G /swapfile

You can verify that the correct amount of space was reserved by executing the following command:

$ ls -lh /swapfile

#Output
$ -rw-r--r-- 1 root root 1.0G Mar 08 10:30 /swapfile

This confirms the swap file has been created with the correct amount of space set aside.


Enabling the Swap Space

Once the correct size file is available we need to actually turn this into swap space. Now you need to lock down the permissions of the file so that only the users with specific privileges can read the contents. This prevents unintended users from being able to access the file, which would have significant security implications. So you need to follow the steps below:

  • Make the file only accessible to specific user e.g. root by executing the following command:

    $ sudo chmod 600 /swapfile
    
  • Verify the permissions change by executing the following command:

    $ ls -lh /swapfile
    
    #Output
    -rw------- 1 root root 1.0G Apr 25 11:14 /swapfile
    

    This confirms only the root user has the read and write flags enabled.

  • Now you need to mark the file as swap space by executing the following command:

    $ sudo mkswap /swapfile
    
    #Sample Output
    Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
    no label, UUID=6e965805-2ab9-450f-aed6-577e74089dbf
    
  • Next you need to enable the swap file, allowing the system to start utilizing it executing the following command:

    $ sudo swapon /swapfile
    
  • You can verify that the swap is available by executing the following command:

    $ sudo swapon --show
    
    #Sample Output
    NAME      TYPE  SIZE USED PRIO
    /swapfile file 1024M   0B   -1
    
  • Finally check the output of the free utility again to validate the settings by executing the following command:

    $ free -h
    
    #Sample Output
              total        used        free      shared  buff/cache   available
    Mem:           488M         37M         96M        652K        354M        425M
    Swap:          1.0G          0B        1.0G
    

Conclusion

Once the Swap Space has been set up successfully the underlying operating system will begin to use it as necessary.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
1

Probably what has happened is that Chromium browser is updated, now takes more memory (or perhaps leaks memory worse..you don't say how many urls it gets before dying)

As a work around, launch a larger instance size. Do don't say what instance size you are using but if you have a t3.micro try a t3.medium instead.

There is an easy to understand chart here https://www.ec2instances.info/?region=eu-west-1

If you have launched an instance and want to resize it without rebuilding from scratch then use the console to take it to state stopped, alter the size and start again

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • Thank you Vorsprung. Why would chromium browser be updated automatically? I haven't made any changes. – DiamondJoe12 Mar 07 '20 at 17:25
  • Also: I'm using the free version of AWS, which is t2.Micro, which has 1.0 GiB. I see t3 Medium has 4.0 GiB, but would cost .045 cents hourly for Linux on demand cost.. which is about 35$ per month.. If I leave my instance running 24 hours a day.. is there a way to anticipate what my costs might be? Thanks. – DiamondJoe12 Mar 07 '20 at 17:29
  • 1
    Some Linux installs have updates on automatic so it's possible it updated without your consent...I can't tell for sure of course! What ever the reason for the system running out of memory, upping the instance size is an easy way to fix the problem. to see the costs the calculator is useful https://calculator.s3.amazonaws.com/index.html – Vorsprung Mar 07 '20 at 17:33
  • Thanks - I updated my question to include my current AWS usage. If memory was an issue - wouldn't that be reflected in the usage quotas table I've included above? – DiamondJoe12 Mar 07 '20 at 17:34
  • 1
    The memory size that is exceeded when the script is run is RAM on the ec2. This is fixed on a t2.micro at 1GB. The billing info simply includes so many hours of ec2 - which is at that fixed memory size. – Vorsprung Mar 07 '20 at 17:45
  • I see, thanks. And you say if I happen to change instance size on the AWS management console, nothing except the instance size will change for that ec2 instance? I really do not want to rebuild it.. – DiamondJoe12 Mar 07 '20 at 18:17
  • 1
    follow the guide on this page https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-resize.html under "To resize an Amazon EBS–backed instance" – Vorsprung Mar 07 '20 at 21:05