How can I turn
and
into line breaks?

Question

Let's say I have an HTML with  and   tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?

I'm using Python's BeautifulSoup library, if that helps at all.

Any preference as to how it's done? I was going to suggest `re.sub(r"
|
", "\n", myString)` — Joel Cornett, May 08 '12 at 01:12
`?p>|
` I guess. Do you only want a newline _after_ the closing tag? — Joel Cornett, May 08 '12 at 01:41
I'd skip Beautiful Soup and just shove it through XSLT instead. — Ignacio Vazquez-Abrams, May 08 '12 at 01:41

Mike Pennington · Answer 1 · 2021-11-13T11:50:52.170

Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.

from BeautifulSoup import BeautifulSoup
import six

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, six.string_types):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text

page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""

soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line

Running this results in...

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt

Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$

AttributeError: module 'types' has no attribute 'StringTypes' — Sainita, Nov 13 '21 at 11:43

score 11 · Answer 2 · answered Aug 09 '16 at 22:12

11

get_text seems to do what you need

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

answered Aug 09 '16 at 22:12

naoko

5,064
4
35
28

13

Not really: get_text(separator='\n') inserts `separator` after *all* tags. So, for instance "This is some text without linebreaks" becomes "This is some text\nwithout\nlinebreaks". Yes, it's weird... – rbp Jul 27 '17 at 06:28

score 5 · Answer 3 · answered Oct 18 '15 at 10:38

This a python3 version of @Mike Pennington's Answer(it really helps),I did a litter refactor.

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text


def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text

To use this,just pass the Beautifulsoup object to get_plain_text methond.

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)

score 1 · Answer 4 · answered Dec 28 '21 at 23:56

1

I use the following small library to accomplish this:

https://github.com/TeamHG-Memex/html-text

pip install html-text

As simple as:

>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'

answered Dec 28 '21 at 23:56

Jean Monet

2,075
15
25

score -6 · Answer 5 · answered May 08 '12 at 01:42

I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the  element that you also remove the closing as well (). Additionally just an FYI the proper HTML5 is   instead of   but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.

How can I turn
and
into line breaks?

5 Answers5

Linked

How can I turn and into line breaks?

5 Answers5

Linked

How can I turn
and
into line breaks?