3

I am scraping some HTML content..

for i, c in enumerate(cards[75:77]):
    print(i)
    a = c.find_element_by_class_name("influencer-stagename")
    print(a.get_attribute('innerHTML'))

Works fine for all records except the 76th one. Output before error...

0
b'<a class="influencer-analytics-link" href="/influencers/sophiewilling"><h5><span>SOPHIE WILLING</span></h5></a>'
1
b'<a class="influencer-analytics-link" href="/influencers/ferntaylorr"><h5><span>Fern Taylor.</span></h5></a>'
2
b'<a class="influencer-analytics-link" href="/influencers/officialshaniceslatter"><h5><span>Shanice Slatter</span></h5></a>'
3

Stacktrace...

> -------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last) <ipython-input-484-0a80d1af1568> in <module>
          3     #print(c.find_element_by_class_name("influencer-stagename").text)
          4     a = c.find_element_by_class_name("influencer-stagename")
    ----> 5     print(a.get_attribute('innerHTML').encode('ascii', 'ignore'))

    ~/anaconda3/envs/py3-env/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py in get_attribute(self, name)
        141                 self, name)
        142         else:
    --> 143             resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
        144             attributeValue = resp.get('value')
        145             if attributeValue is not None:

    ~/anaconda3/envs/py3-env/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py in _execute(self, command, params)
        631             params = {}
        632         params['id'] = self._id
    --> 633         return self._parent.execute(command, params)
        634 
        635     def find_element(self, by=By.ID, value=None):

    ~/anaconda3/envs/py3-env/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
        319         response = self.command_executor.execute(driver_command, params)
        320         if response:
    --> 321             self.error_handler.check_response(response)
        322             response['value'] = self._unwrap_value(
        323                 response.get('value', None))

    ~/anaconda3/envs/py3-env/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
        240                 alert_text = value['alert'].get('text')
        241             raise exception_class(message, screen, stacktrace, alert_text)
    --> 242         raise exception_class(message, screen, stacktrace)
        243 
        244     def _value_or_default(self, obj, key, default):

    WebDriverException: Message: unknown error: bad inspector message: {"id":110297,"result":{"result":{"type":"object","value":{"status":0,"value":"<a class=\"influencer-analytics-link\" href=\"/influencers/bookishemily\"><h5><span>Emily | 18 | GB | Student\uD83C...</span></h5></a>"}}}}   (Session info: chrome=75.0.3770.100)   (Driver info: chromedriver=2.40.565386 (45a059dc425e08165f9a10324bd1380cc13ca363),platform=Mac OS X 10.14.0 x86_64)

I suspect it is an invalid character in

value":"Emily | 18 | GB | Student\uD83C..."

Specifically I suspect "\uD83C"

Adding

.encode("utf-8")  OR   .encode('ascii', 'ignore')

to the second print statement changes nothing.

Any thoughts on how to solve this??

UPDATE: The problem is with Emoji characters. I have found 3 examples to far and each has an emoji (pink flower , russian flag and swirling leaves ). If I edit them out with Chrome inspector my code runs fine but this is not a solution that works at scale

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Axle Max
  • 785
  • 1
  • 14
  • 23

1 Answers1

2

This error message...

WebDriverException: Message: unknown error: bad inspector message: {"id":110297,"result":{"result":{"type":"object","value":{"status":0,"value":"<a class=\"influencer-analytics-link\" href=\"/influencers/bookishemily\"><h5><span>Emily | 18 | GB | Student\uD83C...</span></h5></a>"}}}}   (Session info: chrome=75.0.3770.100)   (Driver info: chromedriver=2.40.565386 (45a059dc425e08165f9a10324bd1380cc13ca363),platform=Mac OS X 10.14.0 x86_64)

...implies that the ChromeDriver was unable to parse some non-UTF-8 characters due to JSON encoding/decoding issue.


Deep Dive

As per the discussion in Issue 723592: 'Bad inspector message' errors when running URL web-platform-tests via webdriver John Chen (Owner - WebDriver for Google Chrome) in his comment mentioned:

A JSON encoding/decoding issue caused the "Bad inspector message" error reported at https://travis-ci.org/w3c/web-platform-tests/jobs/232845351. Part of the error message from part 1 contains an invalid Unicode character \uFDD0 (from https://github.com/w3c/web-platform-tests/blob/34435a4/url/urltestdata.json#L3596). The JSON encoder inside Chrome didn't detect such error, and passed it through in the JSON blob sent to ChromeDriver. ChromeDriver uses base/json/json_parser.cc to parse the JSON string. This parser does a more thorough error detection, notices that \uFDD0 is an invalid character, and reports an error. I think our JSON encoder and decoder should have exactly the same amount of error checking. It's problematic that the encoder can create a blob that is rejected by the decoder.


Analysis

John Chen (Owner - WebDriver for Google Chrome) further added:

The JSON encoding happens in protocol layout of DevTools, just before the result is sent back to ChromeDriver. The relevant code is in https://cs.chromium.org/chromium/src/out/Debug/gen/v8/src/inspector/protocol/Protocol.cpp. In particular, escapeStringForJSON function is responsible for encoding strings. It's actually quite conservative. Anything above 126 is encoded in \uXXXX format. (Note that Protocol.cpp is a generated file. The real source is https://cs.chromium.org/chromium/src/v8/third_party/inspector_protocol/lib/Values_cpp.template.)

The error occurs in the JSON parser used by ChromeDriver. The decoding of \uXXXX sequence happens at https://cs.chromium.org/chromium/src/base/json/json_parser.cc?l=564 and https://cs.chromium.org/chromium/src/base/json/json_parser.cc?l=670. After decoding an escape sequence, the decoder rejects anything that's not a valid Unicode character.

I noticed that there was a recent change to prevent a JSON encoder from emitting invalid Unicode code point: https://crrev.com/478900. Unfortunately it's not the JSON encoder used by the code involved in this bug, so it doesn't help us directly, but it's an indication that we're not the only ones affected by this type of issue.


Solution

This issue was addressed replacing invalid UTF-16 escape sequences when decoding invalid UTF strings in chromedriver as Web platform tests may use ECMAScript strings which aren't necessarily utf-16 characters through this revision / commit.

So a quick solution would be to ensure the following and re-execute your tests:


Alternative

As an alternative you can use GeckoDriver / Firefox combination and you can find a relevant discussion in Chromedriver only supports characters in the BMP error while sending Emoji with ChromeDriver Chrome using Selenium Python to Tkinter's label() textbox

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • Your proposed solution does not work for python-selenium, probably because the latest available version is 3.141.0. Your provided alternative of switching to Firefox with geckodriver works perfectly. It does require rewriting the chrome based python code according to https://stackoverflow.com/questions/50414007/unable-to-invoke-firefox-headless – DaReal May 29 '20 at 14:21