Using RegEx in VB.Net to extract Download Link

Question

I have a Webpage which can contain the following href format.

<a href='/documents/aso2v51_1bk.pdf' target="_blank">Ordering Model – Access Service Volume II – Analysis</a>

the page can contain 0 or more of these type of links. I would like to extract the href path and the Title of the doc. I want to avoid doing this in vb.net via split and hope there is a simple fix in RegEx

In the end , the simplest fix is to use an HTML parser like [HTML Agility Pack](https://stackoverflow.com/q/846994/1115360). — Andrew Morton, Jun 18 '18 at 19:30
You may use the following expression: `(?<=href=)(?.*')(?:.*?(?=>)>)(?.*(?=<))` . Live regex [here](https://regex101.com/r/S1F1fJ/2). The two capture groups are named `href` and `title` respectively. Does this help? — Paolo, Jun 18 '18 at 19:31
Yes that does it pretty much, i made some smaller change so it looks like this now (?<=href='\/documents\/)(?.*')(?:.*?(?=>)>)(?.*(?=<)) my question how can i eliminate the " ' " at the end of the href part in regEx — MisterniceGuy, Jun 18 '18 at 19:51
@MisterniceGuy I've updated my answer with your new requirements, check my answer below. — Paolo, Jun 18 '18 at 20:21

score 0 · Accepted Answer · answered Jun 18 '18 at 20:21

You can use the following expression:

(?<=href='\/documents\/)(?<href>.*(?='))(?:.*?(?=>)>)(?<title>.*(?=<))

(?<=href='\/documents\/) Positive lookbehind.
(?<href>.*(?=')) Named capture group, match everything up to the " ' " .
(?:.*?(?=>)>) Match but don't capture everything up to " > ".
(?<title>.*(?=<)) Named capture group, match everything up to " < ".

Live regex here.

Group href : aso2v51_1bk.pdf

Group title : Ordering Model – Access Service Volume II – Analysis

Using RegEx in VB.Net to extract Download Link

1 Answers1