Get all links in html including within conditional comments

Question

Assuming I have this simple html:

<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>

Is there a way to use lxml.html or BeautifulSoup to get both links? Currently I get only one. In other words, I want the parser to look into html conditional comments also (not sure what the technical term is).

lxml.html

>>> from lxml import html
>>> doc = html.fromstring(s)
>>> list(doc.iterlinks())

<<< [(<Element a at 0x10f7f7bf0>, 'href', 'http://link1.com', 0)]

BeautifulSoup

>>> from BeautifulSoup import BeautifulSoup
>>> b = BeautifulSoup(s)
>>> b.findAll('a')

<<< [<a href="http://link1.com">Link 1</a>]

Does that help to your answer : https://stackoverflow.com/questions/52679150/beautifulsoup-extract-text-from-comment-html — KunduK, May 07 '20 at 15:49

score 2 · Accepted Answer · answered May 07 '20 at 15:52

Need to pull out the comments then parse those.

html = '''<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>'''



from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=True)

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for comment in comments:
    if BeautifulSoup(comment).find_all('a', href=True):
        links += BeautifulSoup(comment).find_all('a', href=True)

print (links)

Output:

[<a href="http://link1.com">Link 1</a>, <a href="http://link2.com">Link 2</a>]

ah yes, thats one way to do that. Why didn't I think of that. Thank you though! — Krimson, May 07 '20 at 15:57
Hey took me forever to learn that too. Wasnt until i was stuck on it that I learned how to do it. Now just stick this in your toolbox for next time. — chitown88, May 07 '20 at 19:53

Get all links in html including within conditional comments

1 Answers1