2

I'm doing web scraping using Headless Chrome (Selenium Chrome Web driver) installed on Ubuntu on EC2. For small number of requests, it's working fine.. but when are large number of simultaneous requests (hundreds) are fired, they kept crashing down and i always had to restart the server.

Has anyone used this to support large load? I am using t2.medium ec2 server.

In the logs, I see Error Communicating with Browser:

2018-12-20 19:18:04.565 ERROR 1292 --- [io-8080-exec-79] o.s.boot.context.web.ErrorPageFilter     : Forwarding to error page from request [/v1.0/search] due to exception [Error communicating with the remote browser. It may have died.
Build info: version: '3.11.0', revision: 'e59cfb3', time: '2018-03-11T20:26:55.152Z'
System info: host: 'ip-172-31-17-81', ip: '172.31.17.81', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-1072-aws', java.version: '1.8.0_191'
Driver info: driver.version: RemoteWebDriver
Capabilities {acceptInsecureCerts: false, acceptSslCerts: false, applicationCacheEnabled: false, browserConnectionEnabled: false, browserName: chrome, chrome: {chromedriverVersion: 2.37.544315 (730aa6a5fdba15..., userDataDir: /tmp/.org.ch
romium.Chromium...}, cssSelectorsEnabled: true, databaseEnabled: false, handlesAlerts: true, hasTouchScreen: false, javascriptEnabled: true, locationContextEnabled: true, mobileEmulationEnabled: false, nativeEvents: true, networkConnectio
nEnabled: false, pageLoadStrategy: normal, platform: LINUX, platformName: LINUX, rotatable: false, setWindowRect: true, takesHeapSnapshot: true, takesScreenshot: true, unexpectedAlertBehaviour: , unhandledPromptBehavior: , version: 69.0.3
497.100, webStorageEnabled: true}
Session ID: e47c3c443164cbd2a3586ee6321d26f8]

org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.
Build info: version: '3.11.0', revision: 'e59cfb3', time: '2018-03-11T20:26:55.152Z'
System info: host: 'ip-172-31-17-81', ip: '172.31.17.81', os.name: 'Linux', os.arch: 'amd64', os.version: '4.4.0-1072-aws', java.version: '1.8.0_191'
Driver info: driver.version: RemoteWebDriver
Capabilities {acceptInsecureCerts: false, acceptSslCerts: false, applicationCacheEnabled: false, browserConnectionEnabled: false, browserName: chrome, chrome: {chromedriverVersion: 2.37.544315 (730aa6a5fdba15..., userDataDir: /tmp/.org.ch
romium.Chromium...}, cssSelectorsEnabled: true, databaseEnabled: false, handlesAlerts: true, hasTouchScreen: false, javascriptEnabled: true, locationContextEnabled: true, mobileEmulationEnabled: false, nativeEvents: true, networkConnectio
nEnabled: false, pageLoadStrategy: normal, platform: LINUX, platformName: LINUX, rotatable: false, setWindowRect: true, takesHeapSnapshot: true, takesScreenshot: true, unexpectedAlertBehaviour: , unhandledPromptBehavior: , version: 69.0.3
497.100, webStorageEnabled: true}
Session ID: e47c3c443164cbd2a3586ee6321d26f8
        at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:566)
        at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:602)
        at org.openqa.selenium.remote.RemoteWebDriver.quit(RemoteWebDriver.java:445)
        at mt.service.KlookServiceImpl.searchTrips(KlookServiceImpl.java:138)
        at mt.controller.TripSearchController.getTripResults(TripSearchController.java:120)
        at sun.reflect.GeneratedMethodAccessor163.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:221)
        at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:137)
        at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:110)
        at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:776)
        at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:705)
        at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:85)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:959)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:893)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:967)
...
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at org.openqa.selenium.net.UrlChecker.waitUntilUnavailable(UrlChecker.java:145)
        at org.openqa.selenium.remote.service.DriverService.stop(DriverService.java:214)
        at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:95)
        at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545)
        ... 58 common frames omitted
Caused by: java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:156)
        at org.openqa.selenium.net.UrlChecker.waitUntilUnavailable(UrlChecker.java:115)
        ... 61 common frames omitted

There is also Java Heap Space issue:

2018-12-20 19:18:20.294 ERROR 1292 --- [io-8080-exec-84] o.s.boot.context.web.ErrorPageFilter     : Forwarding to error page from request [/v1.0/search] due to exception [Java heap space]

java.lang.OutOfMemoryError: Java heap space


Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-bio-8080-exec-96"

When this error occurred, I needed to restart the server. Then everything will be back normal for small number of runs, but the problem repeats when there's larger loads.

user1955934
  • 3,185
  • 5
  • 42
  • 68

1 Answers1

1

These error messages...

ERROR 1292 --- [io-8080-exec-79] o.s.boot.context.web.ErrorPageFilter     : Forwarding to error page from request [/v1.0/search] due to exception [Error communicating with the remote browser. It may have died.
.
org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.
.
2018-12-20 19:18:20.294 ERROR 1292 --- [io-8080-exec-84] o.s.boot.context.web.ErrorPageFilter     : Forwarding to error page from request [/v1.0/search] due to exception [Java heap space]
.
java.lang.OutOfMemoryError: Java heap space
.
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-bio-8080-exec-96"    

...implies that java.lang.OutOfMemoryError was raised while forwarding a request to an error page.


OutOfMemoryError Exception

OutOfMemoryError Exception is a common indication of a memory leak. This error is thrown when there is insufficient space to allocate an object in the Java heap. This situation occurs when the garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further. This error may also be thrown when there is insufficient native memory to support the loading of a Java class. In a rare scenario, a java.lang.OutOfMemoryError may be thrown when an excessive amount of time is being spent doing garbage collection and little memory is being freed.


java.lang.OutOfMemoryError: Java heap space

The detail message Java heap space indicates object could not be allocated in the Java heap. This error does not necessarily imply a memory leak. The problem can be as simple as a configuration issue, where the specified heap size (or the default size, if it is not specified) is insufficient for the application.

In some cases for a particular long-lived application, the message might be an indication that the application is unintentionally holding references to objects, and this prevents the objects from being garbage collected. This is the Java language equivalent of a memory leak.

Note: The APIs that are called by an application could also be unintentionally holding object references.

There had been a lot of discussion going around about the unpredictable CPU and Memory Consumption by Chrome Headless sessions. As per the discussion Building headless for minimum cpu+mem usage the CPU + Memory usage can be optimized by:

  • Using either a custom proxy or C++ ProtocolHandlers you could return stub 1x1 pixel images or even block them entirely.
  • Chromium Team is working on adding a programmatic control over when frames are produced. Currently headless chrome is still trying to render at 60 fps which is rather wasteful. Many pages do need a few frames (maybe 10-20 fps) to render properly (due to usage of requestAnimationFrame and animation triggers) but we expect there are a lot of CPU savings to be had here.
  • MemoryInfra should help you determine which component is the biggest consumer of memory in your setup.
  • An usage can be:

    $ headless_shell --remote-debugging-port=9222 --trace-startup=*,disabled-by-default-memory-infra http://www.chromium.org
    
  • Chromium is always going to use as much resources as are available to it. If you want to effectively limit it's utilization, you should look into using cgroups

You can find a detailed discussion in Limit chrome headless CPU and memory usage


Other Approaches

Taking a leaf out of @BenChilds's answer you always have a finite maximum amount of heap memory configured for usage no matter which ever platform you are on. Java chooses to make the default smaller.

There are several approaches you can adopt to either determine what amount of memory your program needs or to reduce the amount of memory your program uses. One common issue with garbage collected languages such as Java or C# is to keep around references to objects that you no longer are using, or allocating many objects when you could reuse them instead. As long as objects have a reference to them they will continue to use heap space as the garbage collector will not delete them.

In such cases you can use a Java memory profiler to determine which methods in your program are allocating large number of objects and then determine if there is a way to make sure they are no longer referenced, or to not allocate them in the first place. Apart from the above mentioned MemoryInfra another option JMP.

If you determine that you are allocating these objects for a reason and you need to keep around references, you will just need to increase the max heap size when you start your program. In case you can't guarantee that your program will run in some finite amount of memory you will always run into this problem. Only after exhausting all of this will you need to look into caching objects out to disk etc.

As a solution you can always mention ...I need Xgb of memory... for something and you can't work around it by improving your algorithms or memory allocation patterns. Generally this will only usually be the case for algorithms operating on large datasets (like a database or some scientific analysis program) and then techniques like caching and memory mapped IO become useful.

A simpler approach can be to run Java with the command-line option:

  • -Xms1g which sets the 1 GB size of the heap.
  • -Xmx2048m which sets the 2048 MB size of the heap.
  • -Xmx2g which sets the 2 GB size of the heap.
  • -Xmx which sets the maximum size of the heap.

To set this option within Eclipse, you need to go to:

Run -> Run Configurations... -> Click on (x)=Arguments tab -> within VM arguments textbox type -Xms1g, -Xmx2048m or -Xmx

Snapshot:

VM arguments

However, increasing the heap size is not the ideal solution and as per the best practices would be to:

  • Using the correct object type, as an example: String, StringBuffer or StringBuilder
  • Distinguishing between static and non static variables.
  • Properly using multithreading.
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352