How to get all text between just two specified tags using BeautifulSoup?

Question

html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""

I want to get all text between starting tag big upto before the first occurrence of a tag. This means if I take this example, then i must get (iterable) as a string.

Jon Clements · Answer 1 · 2012-08-04T15:16:51.010

An iterative approach.

from BeautifulSoup import BeautifulSoup as bs
from itertools import takewhile, chain

def get_text(html, from_tag, until_tag):
    soup = bs(html)
    for big in soup(from_tag):
        until = big.findNext(until_tag)
        strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip())
        selected = takewhile(lambda node: node != until, strings)
        try:
            yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected))
        except StopIteration as e:
            pass

for text in get_text(html, 'big', 'a'):
    print text

score 4 · Accepted Answer · edited Aug 04 '12 at 14:49

I would avoid nextSibling, as from your question, you want to include everything up until the next <a>, regardless of whether that is in a sibling, parent or child element.

Therefore I think the best approach is to find the node that is the next <a> element and loop recursively until then, adding each string as encountered. You may need to tidy up the below if your HTML is vastly different from the sample, but something like this should work:

from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
    text += firstElement.string
    if (firstElement.next.next == nextATag):             
        return text
    else:
        #Using double next to skip the string nodes themselves
        return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString

yes, exactly, I want to include everything upto the next tag 'a' and there may be any number of tags, texts in between the first 'big' tag and the first 'a' tag — Amit Yadav, Aug 04 '12 at 14:37

score 1 · Answer 3 · answered Aug 04 '12 at 13:47

1

you can do like this :

from BeautifulSoup import BeautifulSoup
html = """
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="test" title="Permalink to this definition"></a>
"""
soup = BeautifulSoup(html)
print soup.find('big').nextSibling.next.text

For details check dom traversing with BeautifulSoup from here

answered Aug 04 '12 at 13:47

mushfiq

1,602
2
19
35

This returns "iterable" rather than "(iterable)" – anotherdave Aug 04 '12 at 14:02

score 0 · Answer 4 · answered Aug 04 '12 at 13:11

0

>>> from BeautifulSoup import BeautifulSoup as bs
>>> parsed = bs(html)
>>> txt = []
>>> for i in parsed.findAll('big'):
...     txt.append(i.text)
...     if i.nextSibling.name != u'a':
...         txt.append(i.nextSibling.text)
...
>>> ''.join(txt)
u'(iterable)'

answered Aug 04 '12 at 13:11

Burhan Khalid

169,990
18
245
284

`nextiSbling` can not be used as I want to include every text upto the first occurrence of tag 'a' – Amit Yadav Aug 04 '12 at 14:39

How to get all text between just two specified tags using BeautifulSoup?

4 Answers4

Linked

Related