1

Assuming I have this simple html:

<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>

Is there a way to use lxml.html or BeautifulSoup to get both links? Currently I get only one. In other words, I want the parser to look into html conditional comments also (not sure what the technical term is).

lxml.html

>>> from lxml import html
>>> doc = html.fromstring(s)
>>> list(doc.iterlinks())

<<< [(<Element a at 0x10f7f7bf0>, 'href', 'http://link1.com', 0)]

BeautifulSoup

>>> from BeautifulSoup import BeautifulSoup
>>> b = BeautifulSoup(s)
>>> b.findAll('a')

<<< [<a href="http://link1.com">Link 1</a>]

Krimson
  • 7,386
  • 11
  • 60
  • 97
  • Does that help to your answer : https://stackoverflow.com/questions/52679150/beautifulsoup-extract-text-from-comment-html – KunduK May 07 '20 at 15:49

1 Answers1

2

Need to pull out the comments then parse those.

html = '''<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>'''



from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=True)

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for comment in comments:
    if BeautifulSoup(comment).find_all('a', href=True):
        links += BeautifulSoup(comment).find_all('a', href=True)

print (links)

Output:

[<a href="http://link1.com">Link 1</a>, <a href="http://link2.com">Link 2</a>]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • ah yes, thats one way to do that. Why didn't I think of that. Thank you though! – Krimson May 07 '20 at 15:57
  • Hey took me forever to learn that too. Wasnt until i was stuck on it that I learned how to do it. Now just stick this in your toolbox for next time. – chitown88 May 07 '20 at 19:53