0

I have a list with results from re.findall from a web page.

List items contains & # 171; & # 8211; etc.

How do i remove these substrings from list items ?

321
  • 1
  • please post more about the list and possibly about the regex you are using. Possibly you can filter them from regexp itself. or, you can process the list you obtained. – Vasif Dec 28 '17 at 22:49
  • Use an HTML parser rather than trying to parse HTML with a regex. – kindall Dec 28 '17 at 22:59
  • unescape from HTMLParser() does not work. It does not escape characters like & # 171; & # 8211; etc. My list contains titles from a blog with some html entities numbers. I need a way to process the final list and remove these substrings. – 321 Dec 28 '17 at 23:03
  • Well, those are broken; they should not have spaces. So if they aren't being parsed, they are being parsed correctly. – kindall Dec 28 '17 at 23:37
  • @kindall, these are not broken. I put the spaces because stackoverflow editor printed these as characters. From here: https://stackoverflow.com/questions/3094659/editing-elements-in-a-list-in-python i managed to solve my problem. I wanted the final list with the results to be processed. #!/usr/bin/python import re mylist = ['bla bla', 'bla ¢ bla bla €'] print mylist tolist = [] for num in mylist: a = re.sub(':', '', num) a = re.sub('&[^\s]*;', '', num) tolist.append(a) print tolist – 321 Dec 28 '17 at 23:57
  • Oh. In that case you should use BeautifulSoup 4; it will parse those for you. – kindall Dec 29 '17 at 00:07

0 Answers0