Regex to catch links with special codes

Question

I have a regex that is close but not quite there:

(https?)://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,\<\>]*)

It's supposed to catch links with special codes in them before they get replaced. Here's an example text:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#></a>. Some text afterwards

Another example:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#first#>&querystring2=<#another#>&querystring3=foo&querystring4=<#bar#></a>

Or even just "plain" links:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=foo&querystring2=bar</a>

I want to capture all these links, without the tags, and some links contain the delimiters.

According to the tester, it's close, but it keeps catching the closing a tag at the end AND the period. I get why, I just don't know how to fix it. In my example, I need to catch the <#specialcode#> and any number of other query strings after it. Without too many details, the <# and #> are delimiters in an application. Any help here would be appreciated.

I took the root regex from here: Get url from a text I've tried testing it here: http://www.regextester.com/

bruh... regex for this? Use `ParseQueryString` http://stackoverflow.com/a/6082703/940217 — Kyle Falconer, Jun 22 '16 at 17:36
@KyleFalconer How does that find and extract a link from mass of text? — Rawling, Jun 22 '16 at 17:37
I take it that you are not actually using `#` unencoded in the parameters - it is a reserved character: [Characters allowed in GET parameter](http://stackoverflow.com/a/1455639/1115360). — Andrew Morton, Jun 22 '16 at 18:11
@AlexeiLevenkov: Waiting for the [pony](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), I guess :) — Jan, Jun 22 '16 at 18:26
@AndrewMorton No, they get replaced by the app, which I tried to clarify in the question. — archangel76, Jun 22 '16 at 18:50

Kyle Falconer · Accepted Answer · 2016-06-22T20:52:22.430

Assuming the input text isn't a proper HTML document, and assuming you're just looking to extract the url and query strings and parameters, this regex will do it:

(https?:\/\/[^?<]+)[?]?(([^=<]+)=(<#[^&<]*#>|[^&<]*)&?)*

This is based on the following test inputs:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#></a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#>&querystring2=foo</a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#>&querystring2=foo&querystring3=<#specialcode2#></a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx</a>. Some text afterwards

The results would be in the capturing groups.

If the given text were an HTML document, then the regex would have to change, because instead of the link being inside <a>http://linkhere.com</a>, it would be in the href attribute: <a href="http://linkhere.com">link here</a>

Regex to catch links with special codes

1 Answers1