Python Webscraping beautifulsoup avoid repetition in find_all()

Question

I am working on web scraping in Python using beautifulsoup. I am trying to extract text in bold or italics or both. Consider the following HTML snippet.

<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>

If I use the command sp.find_all(['i', 'b']), understandably, I get two results, one corresponding to bold and the other to italics. i.e.

['HelloWorld', 'HelloWorld']

My question is, is there a way to uniquely extract it and get the tags?. My desired output is something like -

tag : text - HelloWorld, tagnames : [b,i]

Please note that comparing the text and weeding out non-unique occurrences of the text is not a feasible option, since I might have 'HelloWorld' repeated many times in the text, which I would want to extract.

Thanks!

Tomalak · Accepted Answer · 2020-04-30T08:44:28.163

0

The most natural way of finding nodes that have both  and  among their ancestors would be XPath:

//node()[ancestor::i or ancestor::b]

Instead of node() you could use text() to find text nodes, or * to find elements, depending on the situation. This would not select any duplicates and it does not care in what order  and  are nested.

The issue with this idea is that BeautifulSoup does not support XPath. For this reason, I would use lxml instead of BeautifulSoup for web scraping.

edited Apr 30 '20 at 08:44

answered Apr 28 '20 at 08:46

Tomalak

332,285
67
532
628

Hey thanks for the solution! I am new to lxml, so I'm not clear with most of the terminologies. For this problem, what would be the difference between node() and text(). Also, shouldn't it be 'or' instead of 'and'? Thanks! – OlorinIstari Apr 29 '20 at 09:07
@Shrutheesh Ah, I thought you only wanted nodes that are descendant of *both* `` and ``. Yeah, if that's not the case it should be `or`. *"What's the difference between `node()` and `text()`?"* has been asked on this site before, take a look around. An XPath crash course is a little bit out of scope for this thread. :) My goal was to show a second option of doing this, but by all means stick to BeautifulSoup if you're more comfortable with it. – Tomalak Apr 29 '20 at 09:19
the solution works great! I'll explore xpath a bit more and find my way around. Thanks! – OlorinIstari Apr 29 '20 at 09:50
@Shrutheesh Awesome! Good luck! – Tomalak Apr 29 '20 at 10:22
Hey @Tomalak, there's another tiny query that I had. I realised some of my documents bold the text using 'font-weight : bold' as well. How do I combine this constraint into my previous xpath solution as well? Thanks – OlorinIstari Apr 30 '20 at 04:53
@Shrutheesh If it is defined as inline style (`
`) you can try XPath: `... or ancestor-or-self::*[contains(@style, 'font-weight: bold')]`. But since CSS is **a)** extremely complex (for example, a `font-weight: bold` defined higher-up in the DOM tree might be overridden by a `font-weight: normal` further down) and **b)** can also exist entirely outside of the DOM tree (in a CSS file), "selecting bold elements, no matter why they are bold" is somewhere between difficult and impossible with XPath. It depends very much on the HTML source code in question.
– Tomalak Apr 30 '20 at 08:44
1

Hey @Tomalak, thanks. The solution you provided in the first line works perfectly for my application :) – OlorinIstari Apr 30 '20 at 09:42

score 0 · Answer 2 · answered Apr 28 '20 at 09:21

I would say that it is not clearly defined. What if you have foobar (it can be even more complicated) ?

Anyway, I would say that you have to implement the recursion.

Here is an example:

import bs4

html = """
<div>
  <b> 
    <i>
      HelloWorld
   </i>
  </b>
</div>
"""

def recursive_find(soup):
    for child in soup.children:
        result = child.find_all(['i', 'b'], recursive=False)
        if result:
            if len(result) == 1:
                result_s_result = result[0].find_all(['i', 'b'], recursive=False)
                if len(result_s_result) == 1:
                    print(result_s_result[0].contents)
            else:
                print(result)
        else:
            recursive_find(child)

oneline_html = "".join(line.strip() for line in html.split("\n"))

soup = bs4.BeautifulSoup(oneline_html, 'html.parser')

recursive_find(soup)

Python Webscraping beautifulsoup avoid repetition in find_all()

2 Answers2

Linked