2

I am crawling a website with large pages with the size of 100MB.

driver setting:

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("window-size=1920,1080")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(executable_path="chromedriver",  chrome_options=chrome_options)

The following code

html = driver.page_source

results in error:

selenium WebDriverException: Message: unknown error: bad inspector message
(Session info: headless chrome=66.0.3359.181)
(Driver info: chromedriver=2.38.552518 (183d19265345f54ce39cbb94cf81ba5f15905011),platform=Mac OS X 10.11.6 x86_64)

There is no possibility that it's a "Out of Memory" on my laptop

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Hello lad
  • 17,344
  • 46
  • 127
  • 200
  • Have you seen [this](https://bugs.chromium.org/p/chromedriver/issues/detail?id=1860)? Have you tried other browsers? – JeffC May 21 '18 at 18:21
  • Do you mean the website have pages totaling 100 MB in size, or one webpage with the size of 100 MB? Could you provide an example to help reproduce the error? – Claire May 21 '18 at 18:45

1 Answers1

0

This error message...

selenium WebDriverException: Message: unknown error: bad inspector message

...implies that the ChromeDriver was unable to parse some non-UTF-8 characters due to JSON encoding/decoding issue while executing the line of code:

html = driver.page_source

Analysis

With reference to the comment by John Chen (Owner - WebDriver for Google Chrome) in the discussion Issue 1860: "WebDriverException: Message: unknown error: bad inspector message:" when attempting to get page_source possibly the Page Source of the website in your usecase contains Unicode character point FFFF which is an invalid character. Chrome encodes it as \uFFFF before sending it to ChromeDriver, but ChromeDriver then rejects it as invalid while decoding.

John Chen (Owner - WebDriver for Google Chrome) further added:

The JSON encoding happens in protocol layout of DevTools, just before the result is sent back to ChromeDriver. The relevant code is in https://cs.chromium.org/chromium/src/out/Debug/gen/v8/src/inspector/protocol/Protocol.cpp. In particular, escapeStringForJSON function is responsible for encoding strings. It's actually quite conservative. Anything above 126 is encoded in \uXXXX format. (Note that Protocol.cpp is a generated file. The real source is https://cs.chromium.org/chromium/src/v8/third_party/inspector_protocol/lib/Values_cpp.template.)

The error occurs in the JSON parser used by ChromeDriver. The decoding of \uXXXX sequence happens at https://cs.chromium.org/chromium/src/base/json/json_parser.cc?l=564 and https://cs.chromium.org/chromium/src/base/json/json_parser.cc?l=670. After decoding an escape sequence, the decoder rejects anything that's not a valid Unicode character.

I noticed that there was a recent change to prevent a JSON encoder from emitting invalid Unicode code point: https://crrev.com/478900. Unfortunately it's not the JSON encoder used by the code involved in this bug, so it doesn't help us directly, but it's an indication that we're not the only ones affected by this type of issue.


Solution

This issue was addressed replacing invalid UTF-16 escape sequences when decoding invalid UTF strings in chromedriver as Web platform tests may use ECMAScript strings which aren't necessarily utf-16 characters through this revision / commit.

So a quick solution would be to ensure the following and re-execute your tests:


Alternative

As an alternative you can use GeckoDriver / Firefox combination and you can find a relevant discussion in Chromedriver only supports characters in the BMP error while sending Emoji with ChromeDriver Chrome using Selenium Python to Tkinter's label() textbox

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352