1

I want to extract only relative urls from html page; somebody has suggest this :

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

but it return :

1/all absolute and relative urls from the page.

2/the url may be quated by "" or '' randomly .

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
esnadr
  • 427
  • 3
  • 18
  • Could you try something like this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? – Rafael Barros Jun 29 '14 at 03:41

1 Answers1

4

Use the tool for the job: an HTML parser, like BeautifulSoup.

You can pass a function as an attribute value to find_all() and check whether href starts with http:

from bs4 import BeautifulSoup

data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))

Or, using urlparse and checking for network location part:

def is_relative(url):
    return not bool(urlparse.urlparse(url).netloc)

print soup.find_all('a', href=is_relative)

Both solutions print:

[<a href="test2">test2</a>, 
 <a href="here/we/go">test4</a>]
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195