Python: How to remove html entities from list items?

Asked Dec 28 '17 at 22:47

Active Dec 28 '17 at 22:47

Viewed 17 times

I have a list with results from re.findall from a web page.

List items contains & # 171; & # 8211; etc.

How do i remove these substrings from list items ?

asked Dec 28 '17 at 22:47

321

please post more about the list and possibly about the regex you are using. Possibly you can filter them from regexp itself. or, you can process the list you obtained. – Vasif Dec 28 '17 at 22:49
Use an HTML parser rather than trying to parse HTML with a regex. – kindall Dec 28 '17 at 22:59
unescape from HTMLParser() does not work. It does not escape characters like & # 171; & # 8211; etc. My list contains titles from a blog with some html entities numbers. I need a way to process the final list and remove these substrings. – 321 Dec 28 '17 at 23:03
Well, those are broken; they should not have spaces. So if they aren't being parsed, they are being parsed correctly. – kindall Dec 28 '17 at 23:37
@kindall, these are not broken. I put the spaces because stackoverflow editor printed these as characters. From here: https://stackoverflow.com/questions/3094659/editing-elements-in-a-list-in-python i managed to solve my problem. I wanted the final list with the results to be processed. #!/usr/bin/python import re mylist = ['bla bla', 'bla ¢ bla bla €'] print mylist tolist = [] for num in mylist: a = re.sub(':', '', num) a = re.sub('&[^\s]*;', '', num) tolist.append(a) print tolist – 321 Dec 28 '17 at 23:57
Oh. In that case you should use BeautifulSoup 4; it will parse those for you. – kindall Dec 29 '17 at 00:07

0 Answers0