2

Newbie wrangling with lxml and currently going through an O'Reilly book. After importing html form lxml, using html.parse returns the following error message:

Error reading file 'http://www.emoji-cheat-sheet.com/': failed to load external entity "http://www.emoji-cheat-sheet.com/"

Below is the code:

from lxml import html
page = html.parse('http://www.emoji-cheat-sheet.com/')

This can also be found in the book relevant repository:

https://github.com/jackiekazil/data-wrangling/blob/master/code/chp11-scraping/lxml_emoji_xpath.py

"hmtl.parse"

gemme
  • 21
  • 3
  • 1
    Possible duplicate of [lxml.html. Error reading file; Failed to load external entity](https://stackoverflow.com/questions/29421666/lxml-html-error-reading-file-failed-to-load-external-entity) – Max Smolens May 06 '18 at 14:50

1 Answers1

2

The problem is that since publishing the book, the website emoji-cheat-sheet.com has changed to https://www.webpagefx.com/tools/emoji-cheat-sheet/ so it redirects you there and a simple html.parse cannot handle the redirection (and may struggle with the encryption since it now uses https (secure encrypted) connections, as indeed do most professional websites these days.

I was able to parse it using the requests library:

import requests
page = requests.get('https://www.webpagefx.com/tools/emoji-cheat-sheet')
content=page.content
print(content)

If you try to make an unsecured http request to that particular website, the server redirects you to the https page anyway. Secured pages like that are difficult to parse with the raw library.

http://dictionary.com don't automatically redirect you to the https site and the same code works fine. (I tried your emoji site too and it didn't work)..

If you have to parse that particular site, I suggest BeautifulSoup, I'll see if that works and report back.

lgjmac
  • 133
  • 1
  • 10
  • Thanks, it was actually my oversight at looking at the url to parse. Using a different and valid URL (without a redirect) the error disappears. – gemme May 07 '18 at 10:40