Let's say I have an HTML with <p>
and <br>
tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?
I'm using Python's BeautifulSoup library, if that helps at all.
Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.
from BeautifulSoup import BeautifulSoup
import six
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, six.string_types):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""
soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
print line
Running this results in...
(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt
Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$
get_text
seems to do what you need
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
This a python3 version of @Mike Pennington's Answer(it really helps),I did a litter refactor.
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, str):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
def get_plain_text(soup):
plain_text = ''
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
plain_text+=line
return plain_text
To use this,just pass the Beautifulsoup object to get_plain_text methond.
soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)
I use the following small library to accomplish this:
https://github.com/TeamHG-Memex/html-text
pip install html-text
As simple as:
>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'
I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the <p>
element that you also remove the closing as well (</p>
). Additionally just an FYI the proper HTML5 is <br />
instead of <br>
but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.
|
– Joel Cornett May 08 '12 at 01:12", "\n", myString)`
` I guess. Do you only want a newline _after_ the closing tag? – Joel Cornett May 08 '12 at 01:41