8

Possible Duplicate:
Decode HTML entities in Python string?

I have a string full of HTML escape characters such as ", ”, and —.

Do any Python libraries offer reliable ways for me to replace all of these escape characters with their respective actual characters?

For instance, I want all "s replaced with "s.

Community
  • 1
  • 1
dangerChihuahua007
  • 20,299
  • 35
  • 117
  • 206

1 Answers1

18

You want to use this:

try:
    from html.parser import HTMLParser  # Python 3
except ModuleNotFoundError:
    from HTMLParser import HTMLParser  # Python 2
parser = HTMLParser()
html_decoded_string = parser.unescape(html_encoded_string)

I also am seeing a lot of love for BeautifulSoup

from BeautifulSoup import BeautifulSoup
html_decoded_string = BeautifulSoup(html_encoded_string, convertEntities=BeautifulSoup.HTML_ENTITIES)

Also Duplicate of these existing questions:

Decode HTML entities in Python string?

Decoding HTML entities with Python

Decoding HTML Entities With Python

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Francis Yaconiello
  • 10,829
  • 2
  • 35
  • 54
  • 2
    If you know it's a duplicate, why not flag instead of answering (other than rep)? – kapa Jul 10 '12 at 07:35
  • 2
    Its annoying when people don't take the time to look for existing answers to their questions, especially in this case - when there are so many exact replicas. However, I feel the community overflags sometimes. What if we had misunderstood the question and it really wasn't a duplicate? What if me answering the question sparked a meaningful conversation/thread that takes the question and answer in a different direction? Also its not really about the reputation, once a question is closed or deleted reputation related to it may be negated... – Francis Yaconiello Jul 10 '12 at 14:41
  • 1
    I only tried to warn you about the generally accepted norms of behaviour here on StackOverflow. If you seemed to care a bit, I would look up the Meta question about this, but I guess you can find it yourself if you are interested. I don't want to get into arguing about this, I was just the messenger, do it as you wish :). – kapa Jul 10 '12 at 15:06
  • 1
    With `beautifulsoup4==4.6.0` and py3, this should be `pip install beautifulsoup4` and then `from bs4 import BeautifulSoup; html_decoded_string = BeautifulSoup(x, "lxml"); print(html_decoded_string.string)` – Shadi May 07 '18 at 09:02
  • 1
    In Python 3 this should be `from html.parser import HTMLParser`. – TheInitializer May 22 '18 at 19:49
  • As an update to @TheInitializer comment: `html.parser.HTMLParser().unescape()` is now deprecated. Use `html.unescape()` instead. – jshrimp29 Jul 30 '19 at 18:18