0

How can I go about selecting all the first sibling to all div.title that are not enclosed in a tag using beautifulsoup?

In the example below, I need to retrieve:

[Text I care about which <b>can</b> have formatting..., Text I care about., Text I care about <span class='someclass'>which can be in a span</span>...]

Example

<div class="level1">
    <div class="title">
        Title I do not care about
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        Text I care about which <b>can</b> have formatting...
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about. 
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about <span class='someclass'>which can be in a span</span>...
        </div>
    </div>
</div>

Please note that I will need to modify the text at specific position using some regex. Therefore, I need the entire text with the formatting tags (b, br, span, etc.)

nbeuchat
  • 6,575
  • 5
  • 36
  • 50

2 Answers2

0

You can use the bs4 extract() method to remove the unwanted code from your find_all result items.

For example:

import bs4
soup = bs4.BeautifulSoup(texthere)
divs = soup.find_all("div", {"class":"level3"}) #Finds all divs
for div in divs:
     title = div.find("div", {"class":"title"}) #Finds the title within each div
     title.extract() #Remove that title from the div
     print(div.text) #Here I print the div.text, but you can repurpose this for whatever you need

Here is a good source from SO: Exclude unwanted tag on Beautifulsoup Python

Hope it helps!

cosinepenguin
  • 1,545
  • 1
  • 12
  • 21
  • Thanks, unfortunately it does not behave as expected. First, the class selector `level3` prevents finding the first text I am looking for. If I try this on `level2`, it will return the whole `div.level2` in `div.text`. Also, the text is stripped of the formatting elements `span` and `br`. I ended up modifying the document (which came from another process) to add a `span` containing all the content I am looking for. But that only work because I had access to this generator. – nbeuchat Nov 29 '17 at 17:42
0
`from bs4 import BeautifulSoup;

strn =""" 
<div class="level1">
    <div class="title">
        Title I do not care about
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        Text I care about which <b>can</b> have formatting...
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about. 
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about <span class='someclass'>which can be in a span</span>...
        </div>
    </div>
</div> """



soup = BeautifulSoup(strn, 'html.parser')

the_divs= soup.find_all('div', class_='title')
for the_div in the_divs:
    for the_sibling in the_div.parent.contents:
        if the_sibling.name != 'div':
            print the_sibling.string
`

Play with 'the_sibling' variable here to form a string you need, e.g. 'str(the_sibling)' will return you text with tags it wrapped in (your 's or 's