Python Selenium ().text returns "â€™" instead of apostrophe (')

Question

I am trying to collect the comments on Seeking Alpha (for example: https://seekingalpha.com/article/4243835-teslas-low-2019-capex-harm-growth-story-brand-value). I list one of the comments that I collect with the code in the quotes. The code I am using is ().text (also listed at the end).

The problem is that sometimes, it can correctly return the apostrophe (') [like "Boeing's" in the first paragraph] but at other times, it returns "â€™" [like "Americaâ€™s" in the second paragraph].

"@trentbridge Holy cow what a galactically stupid argument. From Boeing's Official Website (HINT: They do not consider themselves a TECH company)

General Information. Boeing is the world's largest aerospace company and leading manufacturer of commercial jetliners, defense, space and security systems, and service provider of aftermarket support. As Americaâ€™s biggest manufacturing exporter, the company supports airlines and U.S. and allied government customers in more than 150 countries.

..."

It is possible that I can just replace all "â€™" with "'" after collecting all contents. However, I do prefer to figure out a way to refuse getting wrong characters at the first place.

Any help would be appreciated!

[comment.text for comment in driver.find_elements_by_class_name('b-c-content')]

Blckknght · Accepted Answer · 2019-04-18T03:18:26.967

Your issue is that the apostrophe that's being misinterpreted is not a normal apostrophe character ' but instead the Unicode character for a right single quote: ’. The reason it turns into mojibake is that you're decoding the content incorrectly. It's in UTF-8 (so ’ is represented by the three bytes \xe2\x80\x99), but you're decoding it with Codepage 1252 (where the three bytes \xe2\x80\x99 represent three separate characters, â, €, and ™).

Since you haven't shown much code, I can't offer any suggestions on how to fix the decoding issue, but there is probably a way to request Selenium to use UTF-8 (I'm frankly surprised it's not the default). Alternatively, you might be able to get the raw bytes and decode the text yourself.

While it would be best to avoid the mis-decoding, if you really need to fix up your strings after they've been turned to mojibake, the best approach is probably to re-encode them the same way they were mis-decoded, then decode again, correctly this time:

badtext = 'Americaâ€™s'
encoded = badtext.encode('cp1252') 
goodtext = encoded.decode('utf-8') # 'America’s'

Python Selenium ().text returns "â€™" instead of apostrophe (')

1 Answers1

Linked