4

I need to get plain text from an HTML document while honoring <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?

Sean W.
  • 4,944
  • 8
  • 40
  • 66

2 Answers2

4

I like to use the following method. You can do a manual .replace('<br>','\r\n') on the string before passing it to strip_tags(html) to honor new lines.

From this question:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
Community
  • 1
  • 1
That1Guy
  • 7,075
  • 4
  • 47
  • 59
0

You can strip out tags and replace them with spaces (if needed):

import re

myString = re.sub(r"<(/)?br(/)?>", "\n", myString)
myString = re.sub(r"<[^>]*>", " ", myString)
mishik
  • 9,973
  • 9
  • 45
  • 67