16

Let's say I have an HTML with <p> and <br> tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?

I'm using Python's BeautifulSoup library, if that helps at all.

Danica
  • 28,423
  • 6
  • 90
  • 122
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080

5 Answers5

14

Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.

from BeautifulSoup import BeautifulSoup
import six

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, six.string_types):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text

page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""

soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line

Running this results in...

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt

Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$
Mike Pennington
  • 41,899
  • 19
  • 136
  • 174
11

get_text seems to do what you need

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
naoko
  • 5,064
  • 4
  • 35
  • 28
  • 13
    Not really: get_text(separator='\n') inserts `separator` after *all* tags. So, for instance "This is some text without linebreaks" becomes "This is some text\nwithout\nlinebreaks". Yes, it's weird... – rbp Jul 27 '17 at 06:28
5

This a python3 version of @Mike Pennington's Answer(it really helps),I did a litter refactor.

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text


def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text

To use this,just pass the Beautifulsoup object to get_plain_text methond.

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)
Geng Jiawen
  • 8,904
  • 3
  • 48
  • 37
1

I use the following small library to accomplish this:

https://github.com/TeamHG-Memex/html-text

pip install html-text

As simple as:

>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'
Jean Monet
  • 2,075
  • 15
  • 25
-6

I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the <p> element that you also remove the closing as well (</p>). Additionally just an FYI the proper HTML5 is <br /> instead of <br> but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.

Jim
  • 587
  • 2
  • 6
  • 19