How can I convert HTML into text without markup in Python?

Question

I need to get plain text from an HTML document while honoring <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?

score 4 · Accepted Answer · edited May 23 '17 at 11:58

I like to use the following method. You can do a manual .replace('<br>','\r\n') on the string before passing it to strip_tags(html) to honor new lines.

From this question:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

score 0 · Answer 2 · answered Jun 09 '13 at 16:40

0

You can strip out tags and replace them with spaces (if needed):

import re

myString = re.sub(r"<(/)?br(/)?>", "\n", myString)
myString = re.sub(r"<[^>]*>", " ", myString)

answered Jun 09 '13 at 16:40

mishik

9,973
9
45
67

How can I convert HTML into text without markup in Python?

2 Answers2