What's wrong with my Regex string for scraping link elements

Question

I'm having a little problem with a VB.NET scraper, it's supposed to get all links of a html string, which I have already downloaded, and the links are there (I have checked), so it must be something with my regex string.

My regex string: <a.*?href=""(.*?)"".*?>(.*?)</a>

This works for some sites, but for others it does not.

Here are examples from the HTML source that match and don't match.

Working:

<a href="http://domain.com" rel="nofollow" onmousedown="return clk('25936','3')" target="_blank">/a>

Not working:

<a href='http://domain.com' target="_blank" ><font size=2><b>text</b></a>

Could it be because of the " and ' ?

How are you using the regex? Why are there two double-quotes? — Tushar, Sep 21 '16 at 07:45
I am not sure how your regex matches the first example(What tool/language are you using?). You can try [this](https://regex101.com/r/xO1iQ0/1) out, to play around with your regexes. — Kamehameha, Sep 21 '16 at 07:49
*[he comes, he comes](http://stackoverflow.com/a/1732454/1667004)* — ppeterka, Sep 21 '16 at 07:49
`Well two quotes next to eachother means "` yes, but you have `""` and in your sample HTML that will match nothing since you have content inside the quotes — VLAZ, Sep 21 '16 at 07:52

score 2 · Accepted Answer · answered Sep 21 '16 at 07:54

2

Check with following RegExp:

<a.*?href=[",'](.*?)[",'].*?><\/a>

You are using double quotes 2 times. since a tag's href will be used with single and double quotes you have to check with both.

answered Sep 21 '16 at 07:54

Modi Ranga Nayakulu

359
1
18

Thank you, it worked great! – Anders Sep 21 '16 at 07:58

What's wrong with my Regex string for scraping link elements

1 Answers1