-2

I have a Webpage which can contain the following href format.

<a href='/documents/aso2v51_1bk.pdf' target="_blank">Ordering Model – Access Service Volume II – Analysis</a>

the page can contain 0 or more of these type of links. I would like to extract the href path and the Title of the doc. I want to avoid doing this in vb.net via split and hope there is a simple fix in RegEx

MisterniceGuy
  • 1,646
  • 2
  • 18
  • 41
  • 2
    In the end , the simplest fix is to use an HTML parser like [HTML Agility Pack](https://stackoverflow.com/q/846994/1115360). – Andrew Morton Jun 18 '18 at 19:30
  • You may use the following expression: `(?<=href=)(?.*')(?:.*?(?=>)>)(?.*(?=<))` . Live regex [here](https://regex101.com/r/S1F1fJ/2). The two capture groups are named `href` and `title` respectively. Does this help? – Paolo Jun 18 '18 at 19:31
  • Yes that does it pretty much, i made some smaller change so it looks like this now (?<=href='\/documents\/)(?.*')(?:.*?(?=>)>)(?.*(?=<)) my question how can i eliminate the " ' " at the end of the href part in regEx – MisterniceGuy Jun 18 '18 at 19:51
  • @MisterniceGuy I've updated my answer with your new requirements, check my answer below. – Paolo Jun 18 '18 at 20:21

1 Answers1

0

You can use the following expression:

(?<=href='\/documents\/)(?<href>.*(?='))(?:.*?(?=>)>)(?<title>.*(?=<))
  • (?<=href='\/documents\/) Positive lookbehind.
  • (?<href>.*(?=')) Named capture group, match everything up to the " ' " .
  • (?:.*?(?=>)>) Match but don't capture everything up to " > ".
  • (?<title>.*(?=<)) Named capture group, match everything up to " < ".

Live regex here.

Group href : aso2v51_1bk.pdf

Group title : Ordering Model – Access Service Volume II – Analysis

Paolo
  • 21,270
  • 6
  • 38
  • 69