50

I'm trying to extract some text using BeautifulSoup. I'm using get_text() function for this purpose.

My problem is that the text contains </br> tags and I need to convert them to end lines. how can I do this?

MBZ
  • 26,084
  • 47
  • 114
  • 191
  • I know this is 10 years old but did you mean `
    ` and not ``? I'd simply edit it but since there is a possibility that you actually mean broken `` tags (and were misunderstood by the answerers), I though I'd ask.
    – CherryDT Dec 22 '22 at 08:42

7 Answers7

88

You can do this using the BeautifulSoup object itself, or any element of it:

for br in soup.find_all("br"):
    br.replace_with("\n")
Ian Mackinnon
  • 13,381
  • 13
  • 51
  • 67
  • 3
    the benefit of this answer is, that you can call `soup.text` afterwards to remove other html tags, whereas the currently accepted answer doesn't provide that possibility. – the Apr 11 '17 at 15:21
  • 9
    Watch out for this, you may end up losing some contents unintentionally. You may need to do something like `br.replace_with("\n" + br.text)`. This tag is evil... – dividebyzero Feb 21 '18 at 09:52
65

As official doc says:

You can specify a string to be used to join the bits of text together: soup.get_text("\n")

Guts
  • 892
  • 8
  • 9
  • 1
    Seems that "bits of text" are words, not lines, so this would add a newline between each 2 words. – Sasha Apr 02 '18 at 11:56
  • 2
    @Sasha I'm not sure what you mean by that -- I believe "bits of text" refers to text separated by tags. I am certainly not getting a newline between each pair of words when I run it, as you suggest. – Y Davis Jul 16 '19 at 17:54
7

Also you can use ‍‍‍get_text(separator = '\n', strip = True) :

from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
 >> 
some text
some more text

it works for me.

Mohammad Anvari
  • 567
  • 5
  • 8
5

A regex should do the trick.

import re
s = re.sub('<br\s*?>', '\n', yourTextHere)

Hope this helps!

mbinette
  • 5,094
  • 3
  • 24
  • 32
5

Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.replace_with(elem.text + "\n\n")
petezurich
  • 9,280
  • 9
  • 43
  • 57
1

Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.

To steal the list from @petezurich:

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.append('\n')
Zusukar
  • 382
  • 2
  • 8
1

If you call element.text you'll get the text without br tags. Maybe you need define your own custom method for this purpose:

     def clean_text(elem):
        text = ''
        for e in elem.descendants:
            if isinstance(e, str):
                text += e.strip()
            elif e.name == 'br' or e.name == 'p':
                text += '\n'
        return text

    # get page content
    soup = BeautifulSoup(request_response.text, 'html.parser')
    # get your target element
    description_div = soup.select_one('.description-class')
    # clean the data
    print(clean_text(description_div))
Piero
  • 1,583
  • 10
  • 12