1

I'm using Beautiful Soup in Python to attempt to turn some fairly junky HTML into plain text while preserving some of the formatting from HTML, specifically the line break characters.

Here's an example:

from bs4 import BeautifulSoup

html_input = '''
<body>
<p>Full
Name:
John Doe</p>
Phone: 01234123123<br />
Note: This
is a 
test message<br>
It should be ignored.
</body>
'''

message_body_plain = BeautifulSoup(html_input.replace('\n', '').replace('\r', ''))
print (message_body_plain.get_text())

Sometimes the HTML I've got has newlines instead of spaces (see "Full Name" above), and sometimes it doesn't. I've tried taking out all the newlines and also replacing the HTML linebreaks with newline literals, but that breaks when I come across an HTML newline written in a way I hadn't considered. Surely there's a parser that does this for me?

Here's my preferred output:

Full Name: John Doe
Phone: 01234123123
Note: This is a test message
It should be ignored.

Note how the only newlines are from the HTML tags. Does anyone know the best way to achieve what I want?

Andrew Myers
  • 2,754
  • 5
  • 32
  • 40
davis
  • 45
  • 1
  • 8
  • 1
    Have a look at this post http://stackoverflow.com/questions/13337528/rendered-html-to-plain-text-using-python. I didn't flag as duplicate (though I should) because `html2text` is a 3rd party library, it does not ship with vanilla Python. But it is a good library and does what you are looking for. – Cory Kramer Jan 13 '15 at 13:29
  • Thanks. That does indeed do exactly what I want. I have no idea why I didn't find it first. Perhaps I was too fixed on Beautiful Soup, which is apparently (mostly) a parser. Feel free to mark as a dupe! – davis Jan 13 '15 at 13:38

1 Answers1

1

staying within BS you can also try

soup = BeautifulSoup(html_input , "html.parser")

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
            elem.replace_with(elem.text + "\n\n")
Masoud Rahimi
  • 5,785
  • 15
  • 39
  • 67