-1

I have this bulk of HTML that I want to find href links in. The href is in a table that contains info about the href. I need to substract href's only if the info in that row contains (\d{2} -) two digits followed by a space and hyphen.

I will use the regex in C#

I tried regex:

((a|link).*?href=(\"|')(.+?)(\"|').*?)

But that captures way too much...

This is part of the html:

<td style="vertical-align: top; width: 375px"><h1 id="H_67e23f29-bb69-46eb-b15c">00 - <a class="Hyperlink" href="http://xxxx.xxxxxxx.com/Management/HyperlinkLoader.aspx?HyperlinkID=c64b7052-4229-4169-b8a2" class="Hyperlink" onmouseover="HyperlinkLoader.showTooltip(event,'c64b7052-4229-4169-b8a2')" onclick="HyperlinkLoader.followHyperlink(event,{'HyperlinkID':'c64b7052-4229-4169-b8a2','ReturnURL':''});return false">Algemeen</a></h1></td>
<td style="width: 292px; vertical-align: top; background-color: rgb(255, 255, 255); text-align: left"><a class="Hyperlink" href="http://xxxx.xxxxxxx.com/Management/HyperlinkLoader.aspx?HyperlinkID=29517117-26ce-4004-88ed" class="Hyperlink" onmouseover="HyperlinkLoader.showTooltip(event,'29517117-26ce-4004-88ed')" onclick="HyperlinkLoader.followHyperlink(event,{'HyperlinkID':'29517117-26ce-4004-88ed','ReturnURL':''});return false">Asbestbeheersplan</a></td>
<td style="vertical-align: top; width: 375px"><h1 id="H_f534b7b2-a8f5-41b0-9aa8">01 - <a class="Hyperlink" href="http://xxxx.xxxxxxx.com/Management/HyperlinkLoader.aspx?HyperlinkID=7af55197-d865-4cb2-bb9c" class="Hyperlink" onmouseover="HyperlinkLoader.showTooltip(event,'7af55197-d865-4cb2-bb9c')" onclick="HyperlinkLoader.followHyperlink(event,{'HyperlinkID':'7af55197-d865-4cb2-bb9c','ReturnURL':''});return false">Voor werken geldende voorwaarden</a></h1></td>

And this is what I would like to capture:

match 1(href for "00 - Algemeen"): 
http://xxxx.xxxxxxx.com/Management/HyperlinkLoader.aspx?HyperlinkID=c64b7052-4229-4169-b8a2

match 2(href for "01 - Voor werk geldende voorwaarden"):
http://xxxx.xxxxxxx.com/Management/HyperlinkLoader.aspx?HyperlinkID=7af55197-d865-4cb2-bb9c
dreojs16
  • 109
  • 12

1 Answers1

-1

You need to use lookahead and lookbehind regex features.

I tried and found this regex pretty good.

(?<=Hyperlink" href=")((http)(.+?)(\"|').*?)

More info can be found here.

Ygalbel
  • 5,214
  • 1
  • 24
  • 32
  • 1
    Lookarounds only set the context for the match here, but the problem is with the two non-greedy `.*?` / `.+?` that are separated with a specific pattern. The first one needs tempering. – Wiktor Stribiżew Apr 21 '20 at 07:42