0

I have a regex that is close but not quite there:

(https?)://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,\<\>]*)

It's supposed to catch links with special codes in them before they get replaced. Here's an example text:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#></a>. Some text afterwards

Another example:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#first#>&querystring2=<#another#>&querystring3=foo&querystring4=<#bar#></a>

Or even just "plain" links:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=foo&querystring2=bar</a>

I want to capture all these links, without the tags, and some links contain the delimiters.

According to the tester, it's close, but it keeps catching the closing a tag at the end AND the period. I get why, I just don't know how to fix it. In my example, I need to catch the <#specialcode#> and any number of other query strings after it. Without too many details, the <# and #> are delimiters in an application. Any help here would be appreciated.

I took the root regex from here: Get url from a text I've tried testing it here: http://www.regextester.com/

Community
  • 1
  • 1
archangel76
  • 1,544
  • 2
  • 11
  • 18

1 Answers1

1

Assuming the input text isn't a proper HTML document, and assuming you're just looking to extract the url and query strings and parameters, this regex will do it:

(https?:\/\/[^?<]+)[?]?(([^=<]+)=(<#[^&<]*#>|[^&<]*)&?)*

This is based on the following test inputs:

some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#></a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#>&querystring2=foo</a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx?querystring1=<#specialcode#>&querystring2=foo&querystring3=<#specialcode2#></a>. Some text afterwards
some leading text <a>http://subsite.domain.com/somepage.aspx</a>. Some text afterwards

The results would be in the capturing groups.

If the given text were an HTML document, then the regex would have to change, because instead of the link being inside <a>http://linkhere.com</a>, it would be in the href attribute: <a href="http://linkhere.com">link here</a>

Kyle Falconer
  • 8,302
  • 6
  • 48
  • 68