Get relative links from html page

Question

I want to extract only relative urls from html page; somebody has suggest this :

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

but it return :

1/all absolute and relative urls from the page.

2/the url may be quated by "" or '' randomly .

Could you try something like this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? — Rafael Barros, Jun 29 '14 at 03:41

score 4 · Accepted Answer · edited May 23 '17 at 12:08

Use the tool for the job: an HTML parser, like BeautifulSoup.

You can pass a function as an attribute value to find_all() and check whether href starts with http:

from bs4 import BeautifulSoup

data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))

Or, using urlparse and checking for network location part:

def is_relative(url):
    return not bool(urlparse.urlparse(url).netloc)

print soup.find_all('a', href=is_relative)

Both solutions print:

[<a href="test2">test2</a>, 
 <a href="here/we/go">test4</a>]

Get relative links from html page

1 Answers1