Converting HTML to plain text while preserving line breaks

Question

I'm using Beautiful Soup in Python to attempt to turn some fairly junky HTML into plain text while preserving some of the formatting from HTML, specifically the line break characters.

Here's an example:

from bs4 import BeautifulSoup

html_input = '''
<body>
<p>Full
Name:
John Doe</p>
Phone: 01234123123<br />
Note: This
is a 
test message<br>
It should be ignored.
</body>
'''

message_body_plain = BeautifulSoup(html_input.replace('\n', '').replace('\r', ''))
print (message_body_plain.get_text())

Sometimes the HTML I've got has newlines instead of spaces (see "Full Name" above), and sometimes it doesn't. I've tried taking out all the newlines and also replacing the HTML linebreaks with newline literals, but that breaks when I come across an HTML newline written in a way I hadn't considered. Surely there's a parser that does this for me?

Here's my preferred output:

Full Name: John Doe
Phone: 01234123123
Note: This is a test message
It should be ignored.

Note how the only newlines are from the HTML tags. Does anyone know the best way to achieve what I want?

Have a look at this post http://stackoverflow.com/questions/13337528/rendered-html-to-plain-text-using-python. I didn't flag as duplicate (though I should) because `html2text` is a 3rd party library, it does not ship with vanilla Python. But it is a good library and does what you are looking for. — Cory Kramer, Jan 13 '15 at 13:29
Thanks. That does indeed do exactly what I want. I have no idea why I didn't find it first. Perhaps I was too fixed on Beautiful Soup, which is apparently (mostly) a parser. Feel free to mark as a dupe! — davis, Jan 13 '15 at 13:38

score 1 · Answer 1 · edited Jun 14 '19 at 07:46

1

staying within BS you can also try

soup = BeautifulSoup(html_input , "html.parser")

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
            elem.replace_with(elem.text + "\n\n")

edited Jun 14 '19 at 07:46

Masoud Rahimi

5,785
15
39
67

answered Jun 14 '19 at 07:25

Gianluca Tarasconi

124
6

Converting HTML to plain text while preserving line breaks

1 Answers1