0

I'm writing a script using BeautifulSoup to extract text from <p> elements; it works well until I encounter a <p> element that contains <br> tags, in which case it only captures the text BEFORE the first <br> tag. How can I edit my code to capture all of the text?

My code:

coms = soup.select('li > div[class=comments]')[0].select('p')
inp = [i.find(text=True).lstrip().rstrip() for i in coms]

The problem HTML (note the <br> tags):

<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>

What my code currently outputs:

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.'

What my code SHOULD output (note the extra text):

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen. ITR info: Rachel Hoffman, CD Chris Kory, acc. Monitor is Iftiaz Haroon.'

(Note: Forgive my sometimes-questionable terminology; I'm largely self-taught.)

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
Matt Billman
  • 472
  • 5
  • 19
  • Possible duplicate of [BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are](https://stackoverflow.com/questions/2957013/beautifulsoup-just-get-inside-of-a-tag-no-matter-how-many-enclosing-tags-there) – wwii Mar 01 '18 at 19:10
  • Possible duplicate of [How do I use BeautifulSoup4 to get ALL text before
    tag](https://stackoverflow.com/questions/48722571/how-do-i-use-beautifulsoup4-to-get-all-text-before-br-tag)
    – TheoretiCAL Mar 01 '18 at 19:11

2 Answers2

0

I'm afraid that this question might be ill-framed. I copied the HTML into a file then ran the following code:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('matthew.htm').read(), 'lxml')
>>> soup.find('p').text
'             \n                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.\n\nITR info:\n\nRachel Hoffman, CD\nChris Kory, acc.\n\nMonitor is Iftiaz Haroon.                '

Obviously it's a simple matter to recover the required text.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

You can use get_text(strip=True).

From the documentation:

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text using strip=True.

html = '''<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find('p').get_text(strip=True))

Output:

Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.ITR info:Rachel Hoffman, CDChris Kory, acc.Monitor is Iftiaz Haroon.
Community
  • 1
  • 1
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40