Extract Text from
element over
elements

Question

I'm writing a script using BeautifulSoup to extract text from  elements; it works well until I encounter a  element that contains   tags, in which case it only captures the text BEFORE the first   tag. How can I edit my code to capture all of the text?

My code:

coms = soup.select('li > div[class=comments]')[0].select('p')
inp = [i.find(text=True).lstrip().rstrip() for i in coms]

The problem HTML (note the   tags):

<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>

What my code currently outputs:

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.'

What my code SHOULD output (note the extra text):

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen. ITR info: Rachel Hoffman, CD Chris Kory, acc. Monitor is Iftiaz Haroon.'

(Note: Forgive my sometimes-questionable terminology; I'm largely self-taught.)

Possible duplicate of [BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are](https://stackoverflow.com/questions/2957013/beautifulsoup-just-get-inside-of-a-tag-no-matter-how-many-enclosing-tags-there) — wwii, Mar 01 '18 at 19:10
Possible duplicate of [How do I use BeautifulSoup4 to get ALL text before
tag](https://stackoverflow.com/questions/48722571/how-do-i-use-beautifulsoup4-to-get-all-text-before-br-tag) — TheoretiCAL, Mar 01 '18 at 19:11

score 0 · Answer 1 · answered Mar 01 '18 at 19:21

I'm afraid that this question might be ill-framed. I copied the HTML into a file then ran the following code:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('matthew.htm').read(), 'lxml')
>>> soup.find('p').text
'             \n                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.\n\nITR info:\n\nRachel Hoffman, CD\nChris Kory, acc.\n\nMonitor is Iftiaz Haroon.                '

Obviously it's a simple matter to recover the required text.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

You can use get_text(strip=True).

From the documentation:

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text using strip=True.

html = '''<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find('p').get_text(strip=True))

Output:

Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.ITR info:Rachel Hoffman, CDChris Kory, acc.Monitor is Iftiaz Haroon.

Extract Text from element over elements

2 Answers2

Extract Text from
element over
elements