0

I'm running following code for web scraper:

 25 # save source page and return xpath tree
 26 def scrape_Page(url, path):
 27     page = requests.get(url)
 28     tree = html.fromstring(page.text)
 29     # save html content
 30     file_name = url.split('/')[-1] + ".html"
 31     with open(os.path.join(path, file_name), 'wb') as srcFile:
 32         webPage = urllib.urlopen(url)
 33         wPageSrc = webPage.read()
 34         webPage.close()
 35         # write to text file
 36         srcFile.write(wPageSrc)
 37     return tree

The code works well for some url, but fails for few others, and here's the error message I got:

tree = html.fromstring(page.text)
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 669, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 563, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2993, in lxml.etree.fromstring (src/lxml/lxml.etree.c:62433)
  File "parser.pxi", line 1584, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:91750)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

1 Answers1

0

tl;dr: use html.fromstring(r.content).

For more detail, see the lxml documentation under Python unicode strings:

… the parsers in lxml.etree can handle unicode strings straight away … This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding … Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

Meanwhile, if you look at the requests documentation, under Response Content:

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded … When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text.

So, putting those together, you should never call html.fromstring(page.text), because page.text is automatically decoding to Unicode, and lxml doesn't want Unicode. What lxml wants is the raw, undecoded bytes.

How do you get the raw, undecoded bytes out of requests? Look at the very next section of the requests docs, Binary Response Content:

You can also access the response body as bytes, for non-text requests … r.content

If you don't understand the distinction between Unicode strings and bytes strings, and what all this decoding nonsense is about, the Unicode HOWTO in the docs has a great explanation. But basically: Network sockets (and files, and many other things) only deal in bytes, which means they can only handle 256 different values, but there are hundreds of thousands of characters. How do you deal with that? You pick an encoding, use it to convert the Unicode text into a sequence of bytes, send that over the wire, and decode it on the other end. This means you need some way to specify which encoding you picked, so the other side can decode it. Web pages generally specify it in a header, although there are a few other ways to do it. requests tries to be smart and dig out the information for you and take care of the decoding so you don't have to think about it, which is normally very cool. Unfortunately, lxml also tries to be smart and figure out the decoding for you, and if they both try to do it, they're going to confuse each other.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • now I understand why I shouldn't use r.text. But why should I use r.content instead? Could you explain in more detail. Why the parse accept the result from r.content? – c81e728d9d4c2f636f067f89cc1486 Oct 15 '14 at 17:45
  • @c81e728d9d4c2f636f067f89cc1486: OK, I'll update the answer. – abarnert Oct 15 '14 at 17:50
  • @c81e728d9d4c2f636f067f89cc1486: Actually, does the possible dup answer your question? If so, we should just close this question as a dup, you can go upvote that answer, and I don't need to try to explain it in my usual 10x as many words as Rob. :) – abarnert Oct 15 '14 at 18:02