0

I'm having a hard time trying to get rid of all the extra HTML tags within the text I scraped from a certain web page, however, str.replace() in Python doesn't seem to be working for targets like <br> and =, while other tags such as <li></li> will be successfully replaced.

Here's my code.

str(txt).replace('<li>', '')
        .replace('</li>', '')
        .replace('<ol>', '')
        .replace('</ol>', '')
        .replace('<br>', '')
        .replace('=', '')

Any advice will be much appreciated.

Robert Valencia
  • 1,752
  • 4
  • 20
  • 36
Yuta
  • 37
  • 8
  • Possible duplicate of [Strip HTML from strings in Python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) – Robert Valencia Apr 14 '17 at 01:28

1 Answers1

1

You can use BeautifulSoup to get the text from the page:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_source)
text = soup.get_text()

BeautifulSoup parses the html, and has an easy built-in function for getting the text.

zbw
  • 922
  • 5
  • 13