3

.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.

It worked all fine until lately it's changed to Python 3.

I tested the same .html file in another machine Python 2, it works and returned the page contents.

soup = BeautifulSoup(open('page.html'), "lxml")

Machine with Python 3 doesn't work, and it says:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

How can I use Python 3 to parse this html page?

Thank you.

Mark K
  • 8,767
  • 14
  • 58
  • 118
  • Sounds like the HTML is probably declaring the wrong encoding. I don't know how you'd override that, though. – user2357112 Oct 09 '19 at 08:38
  • When you say `open('page.html', 'r')`, then Python reads the document as plain-text and tries to decode it with some locale-dependent default, which is apparently GBK in your case. `lxml` should be fine with a binary stream however, so you should try opening it with `open('page.html', 'rb')`. Or you specify the correct encoding with the `encoding=` parameter. Note: depending on how the page was saved, the encoding declaration in the document may or may not be correct. – lenz Oct 09 '19 at 08:48
  • @lenz, it says "TypeError: 'from_encoding' is an invalid keyword argument for open()" – Mark K Oct 09 '19 at 08:57
  • The parameter is called `encoding`, not `from_encoding`. – lenz Oct 09 '19 at 08:59
  • @lenz, it says "ValueError: binary mode doesn't take an encoding argument". – Mark K Oct 09 '19 at 09:06
  • If you open with `rb` you can't pass an encoding. The encoding is used to decode the binary string into a unicode string, which only happens if you open in text mode. – GPhilo Oct 09 '19 at 09:07
  • It's either or. Either you use binary mode (`'rb'`) and let the HTML parser deal with decoding, or you open a text stream with `open('page.html', 'r', encoding=...)`. – lenz Oct 09 '19 at 09:07
  • @lenz indeed, I amended the comment :) – GPhilo Oct 09 '19 at 09:12

2 Answers2

2

It worked all fine until lately it's changed to Python 3.

Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it. Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is. Try opening page.html as a byte object (open('page.html', 'rb')) and see if that works for you.

GPhilo
  • 18,519
  • 9
  • 63
  • 89
  • thanks for the reply. It give 1 more warning, says: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. – Mark K Oct 09 '19 at 08:51
  • That's a warning from BeautifulSoup, see here for how to get rid of it: https://stackoverflow.com/questions/33511544/how-to-get-rid-of-beautifulsoup-user-warning – GPhilo Oct 09 '19 at 08:53
  • it's an additional warning message. The problem is still there. – Mark K Oct 09 '19 at 09:04
  • 1
    @MarkK are you saying you opened the document in binary mode (`open(..., 'rb')`), and you still get a `UnicodeDecodeError`? – lenz Oct 09 '19 at 09:06
  • 1
    @GPhilo, it seems the problem wasn't in the BeautifulSoup part. I posted some changes, which helped solved the problem. – Mark K Oct 10 '19 at 09:35
1

2 changes I done and not sure which one (or both) took the effect.

The computer was formatted and reinstalled so some settings are different.

1.In the language settings,

Administrative language settings > Change system locale > 

Tick the box

Beta: Use Unicode UTF-8 for worldwide language support

2.on the coding, for example, this is the original line:

print (soup.find_all('span', attrs={'class': 'listing-row__price'})[0].text.strip().encode("utf-8"))

When the part ".encode("utf-8")" was removed, it worked.

  • update on 16th Oct. 2019 Above change works, but when the box is ticked. Fonts and texts in foreign language software doesn't display properly.

    Beta: Use Unicode UTF-8 for worldwide language support
    

When the box was unticked, Fonts and texts in foreign language software are displayed well. But, problem in the question remains.

Solution with the box unticked - both foreign language software and Python codes work:

soup = BeautifulSoup(open(pages, 'r', encoding = 'utf-8', errors='ignore'), "lxml")
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    The second is the one that "solves" your problem, by simply printing the raw byte string instead of trying to encode it as UTF-8. You still have invalid unicode characters in your text, but if that's not important for your usage, ignoring them is a good option ;) – GPhilo Oct 10 '19 at 09:37
  • @GPhilo, however it seems not - when the box "Beta: Use Unicode UTF-8 for worldwide language support" unticked, the problem pops again. (when the When the part ".encode("utf-8")" was removed, it doesn't worked.) – Mark K Oct 11 '19 at 06:43