0

How to use Python to decode

[Hài kịch] Vợ ơi là vợ - Vân Sơn Bảo Liêm & Lê Huỳnh

into this

[Hài kịch] Vợ ơi là vợ - Vân Sơn Bảo Liêm & Lê Huỳnh

Thanks.


I have tried the following code from the above susgeted thread:

import re, HTMLParser
title="[Hài kịch] Vợ ơi là vợ - Vân Sơn Bảo Liêm & Lê Huỳnh"
list_of_html = re.findall("&.+?;", title) 
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e)
    title = title.replace(e, unescaped)
print title

but got an error message:

Unsupported characters in input 

because I have these words in the title "kịch Vợ ơi vợ - Sơn Bảo Huỳnh". How can I correct it?

Srini
  • 1,619
  • 1
  • 19
  • 34
Son
  • 11
  • 1
  • Possible duplicate of [Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi](https://stackoverflow.com/questions/21342549/unescaping-html-with-special-characters-in-python-2-7-3-raspberry-pi) – ash Aug 27 '18 at 18:05
  • Look at what your regex captures: it finds `"& Lê"` – even if non-greedy – and you have an unescaped ampersand, which the HTML parser doesn't like. Change the regex to this: `r"&[A-Za-z]+;"`. – lenz Aug 29 '18 at 10:48

0 Answers0