0

I am trying to extra web links from web content with Python regex. here's my python script

webUrlList = re.findall(r"(?<=<a href=\").+(.html|/)(?=\")", content)
print webUrlList

and the matched webUrlList is like:

['/', '.html', '/', '/', '/', '/',...] 

please help me find out the reason why this script yield the above output.

target weburl strings samples:

<a href="http://ab.test.com/flower/1111027378112/purple/119735281586093.html"

<a href="/abcabcdef/coffee/su1/" 
shanwu
  • 1,493
  • 6
  • 35
  • 45

2 Answers2

2

If you're only parsing for links, and you're familiar with the content you will be parsing, the following regex should help you accomplish what you're after and is pretty safe.

regex = re.compile(r'href="([^"]+)')
results = re.findall(regex, <CONTENT-HERE>)
  • href=" consumes but doesn't capture the literal characters href="
  • ([^"]+) consumes and captures any character that isn't a quotation mark

Run a few trials with the content you are scraping and assess whether you need more specificity in the regex or not.

wpcarro
  • 1,528
  • 10
  • 13
1

Use a html parser like BeautifulSoup:

soup = BeautifulSoup(content, "html.parser")

print([a["href"] for a in soup.find_all("a", href=True)])

Don't use a regex to parse html

Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    This requires adding an additional module, BeautifulSoup, to the project. I understand that there may be better tools to parse HTML than regular expressions. But this question is asking for extracting web links using regular expressions. So while your answer works and is elegant, it seems to side-step the what's being asked. – wpcarro Jul 03 '16 at 17:46
  • @wcarroll,http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 you should not use regex to parse html, there is no side-stepping what is being asked, it is the correct approach to what is essentially being asked. – Padraic Cunningham Jul 03 '16 at 17:48
  • 1
    I almost included in my comment "yes I have seen the infamous SO post". I guess I should have been explicit. This doesn't change my comment above. If he is only parsing small strings that contain HTML, regular expressions are fit for the task and I think preferable to including a third-party module and learning its API. – wpcarro Jul 03 '16 at 17:50
  • @wcarroll, where does it say they are *parsing small strings that contain HTML*, *I am trying to extra web links from web content* seems pretty clear that they are parsing the full content returned. I am not going to encourage anyone to parse html with a regex and anyone that does is leading someone down a bad path – Padraic Cunningham Jul 03 '16 at 17:52
  • But he is not parsing the whole html to make a DOM either, isn't it ok to look for http URIs in it ? If no, why exactly ? – M. Timtow Jul 03 '16 at 17:58
  • @M.Timtow, the correct way to parse html is with a html parser just like the correct way to parse xml is with an xml parser, if you are a bit unclear as to why then read the linked answer. – Padraic Cunningham Jul 03 '16 at 18:06
  • @PadraicCunningham Yes, I think I found an explanation here : http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – M. Timtow Jul 03 '16 at 18:07
  • But still, to my knowledge, you can not nest indefinitely html tags inside an http URI after href= – M. Timtow Jul 03 '16 at 18:09
  • @M.Timtow, what if the OP wants particular links from a certain part of the html that are nested inside a whole lot more html, can you imagine how that regex would look and how brittle it would be? The bottom line is don't parse html with a regex unless you want broken code. A lot of people only use a regex because they don't know of any alternative methods. – Padraic Cunningham Jul 03 '16 at 18:11
  • @PadraicCunningham In this case, you are right, it seems impossible : http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg (the phone number example) – M. Timtow Jul 03 '16 at 18:16