Use the tool for the job: an HTML parser
, like BeautifulSoup
.
You can pass a function as an attribute value to find_all()
and check whether href
starts with http
:
from bs4 import BeautifulSoup
data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))
Or, using urlparse
and checking for network location part:
def is_relative(url):
return not bool(urlparse.urlparse(url).netloc)
print soup.find_all('a', href=is_relative)
Both solutions print:
[<a href="test2">test2</a>,
<a href="here/we/go">test4</a>]