I am trying to extra web links from web content with Python regex. here's my python script
webUrlList = re.findall(r"(?<=<a href=\").+(.html|/)(?=\")", content)
print webUrlList
and the matched webUrlList is like:
['/', '.html', '/', '/', '/', '/',...]
please help me find out the reason why this script yield the above output.
target weburl strings samples:
<a href="http://ab.test.com/flower/1111027378112/purple/119735281586093.html"
<a href="/abcabcdef/coffee/su1/"