1

Is there any simple, robust and fast way to extract all anchors' href attributes HTML in python?

I know there is a solution using BeautifulSoup, but the problem with BeautifulSoup is that it's too heavy, and consumes a lot of memory on some URLs.

The task that I'm talking about is very simple - just run over an HTML and return all the HREFs of all the anchors.

Anybody knows?

Thanks!

diemacht
  • 2,022
  • 7
  • 30
  • 44

1 Answers1

2

You could use the HTMLParser.

from HTMLParser import HTMLParser

class extract_href(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for key, val in attrs:
                if key == 'href':
                    print val

parser = extract_href()
parser.feed("""<p><a href='www.stackoverflow.com'>link</a></p>""")
Anonymous Coward
  • 6,186
  • 1
  • 23
  • 29