Convert
to end line

Question

I'm trying to extract some text using BeautifulSoup. I'm using get_text() function for this purpose.

My problem is that the text contains </br> tags and I need to convert them to end lines. how can I do this?

I know this is 10 years old but did you mean `
` and not ``? I'd simply edit it but since there is a possibility that you actually mean broken `` tags (and were misunderstood by the answerers), I though I'd ask. — CherryDT, Dec 22 '22 at 08:42

score 88 · Answer 1 · answered Jan 06 '16 at 18:40

88

You can do this using the BeautifulSoup object itself, or any element of it:

for br in soup.find_all("br"):
    br.replace_with("\n")

answered Jan 06 '16 at 18:40

Ian Mackinnon

13,381
13
51
67

3

the benefit of this answer is, that you can call `soup.text` afterwards to remove other html tags, whereas the currently accepted answer doesn't provide that possibility. – the Apr 11 '17 at 15:21
9

Watch out for this, you may end up losing some contents unintentionally. You may need to do something like `br.replace_with("\n" + br.text)`. This tag is evil... – dividebyzero Feb 21 '18 at 09:52

score 65 · Answer 2 · answered Feb 05 '18 at 17:22

65

As official doc says:

You can specify a string to be used to join the bits of text together: soup.get_text("\n")

answered Feb 05 '18 at 17:22

Guts

892
8
9

1

Seems that "bits of text" are words, not lines, so this would add a newline between each 2 words. – Sasha Apr 02 '18 at 11:56
2

@Sasha I'm not sure what you mean by that -- I believe "bits of text" refers to text separated by tags. I am certainly not getting a newline between each pair of words when I run it, as you suggest. – Y Davis Jul 16 '19 at 17:54

score 7 · Answer 3 · answered Nov 06 '21 at 06:37

Also you can use ‍‍‍get_text(separator = '\n', strip = True) :

from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
 >> 
some text
some more text

it works for me.

score 5 · Accepted Answer · answered Sep 22 '12 at 17:05

5

A regex should do the trick.

import re
s = re.sub('<br\s*?>', '\n', yourTextHere)

Hope this helps!

answered Sep 22 '12 at 17:05

mbinette

5,094
3
24
32

This will fail to convert self-closing tags – CherryDT Dec 22 '22 at 08:39

petezurich · Answer 5 · 2022-12-22T08:32:54.843

5

Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.replace_with(elem.text + "\n\n")

edited Dec 22 '22 at 08:32

answered Nov 28 '18 at 08:36

petezurich

9,280
9
43
57

score 1 · Answer 6 · answered Feb 19 '20 at 19:13

1

Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.

To steal the list from @petezurich:

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.append('\n')

answered Feb 19 '20 at 19:13

Zusukar

382
2
8

score 1 · Answer 7 · answered Jun 19 '20 at 04:31

If you call element.text you'll get the text without br tags. Maybe you need define your own custom method for this purpose:

     def clean_text(elem):
        text = ''
        for e in elem.descendants:
            if isinstance(e, str):
                text += e.strip()
            elif e.name == 'br' or e.name == 'p':
                text += '\n'
        return text

    # get page content
    soup = BeautifulSoup(request_response.text, 'html.parser')
    # get your target element
    description_div = soup.select_one('.description-class')
    # clean the data
    print(clean_text(description_div))

Convert
to end line

7 Answers7

Linked

Convert to end line

7 Answers7

Linked

Convert
to end line