3

I am trying to collect the comments on Seeking Alpha (for example: https://seekingalpha.com/article/4243835-teslas-low-2019-capex-harm-growth-story-brand-value). I list one of the comments that I collect with the code in the quotes. The code I am using is ().text (also listed at the end).

The problem is that sometimes, it can correctly return the apostrophe (') [like "Boeing's" in the first paragraph] but at other times, it returns "’" [like "America’s" in the second paragraph].

"@trentbridge Holy cow what a galactically stupid argument. From Boeing's Official Website (HINT: They do not consider themselves a TECH company)

General Information. Boeing is the world's largest aerospace company and leading manufacturer of commercial jetliners, defense, space and security systems, and service provider of aftermarket support. As America’s biggest manufacturing exporter, the company supports airlines and U.S. and allied government customers in more than 150 countries.

..."

It is possible that I can just replace all "’" with "'" after collecting all contents. However, I do prefer to figure out a way to refuse getting wrong characters at the first place.

Any help would be appreciated!

[comment.text for comment in driver.find_elements_by_class_name('b-c-content')]
Annie Q W
  • 89
  • 1
  • 5

1 Answers1

6

Your issue is that the apostrophe that's being misinterpreted is not a normal apostrophe character ' but instead the Unicode character for a right single quote: . The reason it turns into mojibake is that you're decoding the content incorrectly. It's in UTF-8 (so is represented by the three bytes \xe2\x80\x99), but you're decoding it with Codepage 1252 (where the three bytes \xe2\x80\x99 represent three separate characters, â, , and ).

Since you haven't shown much code, I can't offer any suggestions on how to fix the decoding issue, but there is probably a way to request Selenium to use UTF-8 (I'm frankly surprised it's not the default). Alternatively, you might be able to get the raw bytes and decode the text yourself.

While it would be best to avoid the mis-decoding, if you really need to fix up your strings after they've been turned to mojibake, the best approach is probably to re-encode them the same way they were mis-decoded, then decode again, correctly this time:

badtext = 'America’s'
encoded = badtext.encode('cp1252') 
goodtext = encoded.decode('utf-8') # 'America’s'
Blckknght
  • 100,903
  • 11
  • 120
  • 169