I'm trying to extract some text using BeautifulSoup
. I'm using get_text()
function for this purpose.
My problem is that the text contains </br>
tags and I need to convert them to end lines. how can I do this?
I'm trying to extract some text using BeautifulSoup
. I'm using get_text()
function for this purpose.
My problem is that the text contains </br>
tags and I need to convert them to end lines. how can I do this?
You can do this using the BeautifulSoup object itself, or any element of it:
for br in soup.find_all("br"):
br.replace_with("\n")
As official doc says:
You can specify a string to be used to join the bits of text together: soup.get_text("\n")
Also you can use get_text(separator = '\n', strip = True)
:
from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
>>
some text
some more text
it works for me.
A regex should do the trick.
import re
s = re.sub('<br\s*?>', '\n', yourTextHere)
Hope this helps!
Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.replace_with(elem.text + "\n\n")
Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.
To steal the list from @petezurich:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.append('\n')
If you call element.text
you'll get the text without br tags.
Maybe you need define your own custom method for this purpose:
def clean_text(elem):
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name == 'br' or e.name == 'p':
text += '\n'
return text
# get page content
soup = BeautifulSoup(request_response.text, 'html.parser')
# get your target element
description_div = soup.select_one('.description-class')
# clean the data
print(clean_text(description_div))
` and not ``? I'd simply edit it but since there is a possibility that you actually mean broken `` tags (and were misunderstood by the answerers), I though I'd ask. – CherryDT Dec 22 '22 at 08:42