match web url with python regex

Question

I am trying to extra web links from web content with Python regex. here's my python script

webUrlList = re.findall(r"(?<=<a href=\").+(.html|/)(?=\")", content)
print webUrlList

and the matched webUrlList is like:

['/', '.html', '/', '/', '/', '/',...]

please help me find out the reason why this script yield the above output.

target weburl strings samples:

<a href="http://ab.test.com/flower/1111027378112/purple/119735281586093.html"

<a href="/abcabcdef/coffee/su1/"

I'm having trouble reproducing the output your citing. When using the regex that you supplied, `r"(?<= — wpcarro, Jul 03 '16 at 17:53
Just make the capturing group a noncapturing one. And use lazy dot matching. — Wiktor Stribiżew, Jul 03 '16 at 18:01

wpcarro · Accepted Answer · 2016-07-03T19:29:41.000

2

If you're only parsing for links, and you're familiar with the content you will be parsing, the following regex should help you accomplish what you're after and is pretty safe.

regex = re.compile(r'href="([^"]+)')
results = re.findall(regex, <CONTENT-HERE>)

href=" consumes but doesn't capture the literal characters href="
([^"]+) consumes and captures any character that isn't a quotation mark

Run a few trials with the content you are scraping and assess whether you need more specificity in the regex or not.

edited Jul 03 '16 at 19:29

answered Jul 03 '16 at 18:04

wpcarro

1,528
10
13

You are using `re.findall`. `r'href="([^"]+)'` is enough. – Wiktor Stribiżew Jul 03 '16 at 18:11
@WiktorStribiżew indeed it is. Good catch. I'll modify the answer above. – wpcarro Jul 03 '16 at 18:16

score 1 · Answer 2 · edited May 23 '17 at 12:08

1

Use a html parser like BeautifulSoup:

soup = BeautifulSoup(content, "html.parser")

print([a["href"] for a in soup.find_all("a", href=True)])

Don't use a regex to parse html

edited May 23 '17 at 12:08

Community

1
1

answered Jul 03 '16 at 17:42

Padraic Cunningham

176,452
29
245
321

1

This requires adding an additional module, BeautifulSoup, to the project. I understand that there may be better tools to parse HTML than regular expressions. But this question is asking for extracting web links using regular expressions. So while your answer works and is elegant, it seems to side-step the what's being asked. – wpcarro Jul 03 '16 at 17:46
@wcarroll,http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 you should not use regex to parse html, there is no side-stepping what is being asked, it is the correct approach to what is essentially being asked. – Padraic Cunningham Jul 03 '16 at 17:48
1

I almost included in my comment "yes I have seen the infamous SO post". I guess I should have been explicit. This doesn't change my comment above. If he is only parsing small strings that contain HTML, regular expressions are fit for the task and I think preferable to including a third-party module and learning its API. – wpcarro Jul 03 '16 at 17:50
@wcarroll, where does it say they are *parsing small strings that contain HTML*, *I am trying to extra web links from web content* seems pretty clear that they are parsing the full content returned. I am not going to encourage anyone to parse html with a regex and anyone that does is leading someone down a bad path – Padraic Cunningham Jul 03 '16 at 17:52
But he is not parsing the whole html to make a DOM either, isn't it ok to look for http URIs in it ? If no, why exactly ? – M. Timtow Jul 03 '16 at 17:58
@M.Timtow, the correct way to parse html is with a html parser just like the correct way to parse xml is with an xml parser, if you are a bit unclear as to why then read the linked answer. – Padraic Cunningham Jul 03 '16 at 18:06
@PadraicCunningham Yes, I think I found an explanation here : http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – M. Timtow Jul 03 '16 at 18:07
But still, to my knowledge, you can not nest indefinitely html tags inside an http URI after href= – M. Timtow Jul 03 '16 at 18:09
@M.Timtow, what if the OP wants particular links from a certain part of the html that are nested inside a whole lot more html, can you imagine how that regex would look and how brittle it would be? The bottom line is don't parse html with a regex unless you want broken code. A lot of people only use a regex because they don't know of any alternative methods. – Padraic Cunningham Jul 03 '16 at 18:11
@PadraicCunningham In this case, you are right, it seems impossible : http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg (the phone number example) – M. Timtow Jul 03 '16 at 18:16

match web url with python regex

2 Answers2