0

I have 4 different cases of links which have other attributes, 3 of which I need to match and capture just the text of the link.

In short - the first 3 links need to be matched by href property and capture the text between the tags.

<a href="https://example.com/page_url" data-some-id="" data-other-prop="">Link 1</a>
<a data-href="" href="http://Go to page" data-another-id="">Link 2</a>
<a data-other="" href="/Go to page" data-val-id="">Link 3</a>
<a href="http://example123.com/page" data-props-id="">Link 4</a>

Regex needs to match:

  • URLs that either contain 'example.com' (link 1 example), or
  • Links that don't contain a domain (link 2 example), or
  • If there's no schema e.g http (link 3 example)
  • Non-href attributes can have different names, so 'data-', 'style="' and other properties can either be before or after href.
  • It needs to be specific to anchor (<a>) tag

The 4th link shouldn't be captured. And the 4th link will always have different domain to link 1 (example.com).

I had plenty of attempts these 2 days, but can't get it right, generally the pipe(regex or) together with '.*' and a negative match gets me every time e.g.

<a.*(?:example\.com|(?!href="http?.*([\s])))+".*>(.*)<\/a>

It seems to be tougher than it looks to get the required match.

Note: this is for a response HTML in a string, and matching is happening before it's applied to DOM. So jQuery and DOM related solutions are out of the question, sorry. Progressive capturing using multiple expressions is welcome.

Dmitriy Kravchuk
  • 476
  • 4
  • 16
  • 2
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - use a parser – ctwheels Jan 26 '18 at 17:48
  • Are you saying I should crunch down every anchor to `href=".*"` and with matches in a loop go from there? That's definitely an acceptable solution. I guess I will answer my own question with that approach. Thought this might be a but much to ask for on stackoverflow. – Dmitriy Kravchuk Jan 26 '18 at 17:57
  • It's a string as stated in "Note". Can't use DOM before sending as response. Its JS based API. Loops and IFs it is for me then. – Dmitriy Kravchuk Jan 26 '18 at 18:09
  • You can make a string a DOM element. See [this SO post](https://stackoverflow.com/questions/3103962/converting-html-string-into-dom-elements) for more info – ctwheels Jan 26 '18 at 18:12
  • Thanks for suggestion @ctwheels. It's definitely something to keep in mind for UI scripts and HTML string responses. But this script will be used on a server as well which doesn't support DOM related functions. Or at least I ran into issues with syntax in my previous attempts. – Dmitriy Kravchuk Jan 26 '18 at 18:38
  • When you say server-side which language are you using? – ctwheels Jan 26 '18 at 18:45
  • It's called SuiteScript, which is an extended version of javascript specific to it's online solution. – Dmitriy Kravchuk Jan 26 '18 at 18:48
  • You should tag your post with that tag. It may help to find someone who's knowledgeable in SuiteScript and can help you write a proper parser. While a regex solution *may* work, it's **definitely not** the best answer – ctwheels Jan 26 '18 at 18:50
  • Thanks for suggestion, but regex solution was already in place (which had to be modified). No API was required or changes needed, so the tags are fine. In my case it's the most efficient way to replace the tags with varying HTML snippets, that don't apply to the current page, it's easily readable to maintain, and it doesn't cause reliability issues between server and UI. The snippets are small enough to pinpoint issues quick, and eliminates the need for making IFs for each scenario. But I will definitely look into possibility of parsing for future server scripts. – Dmitriy Kravchuk Jan 26 '18 at 19:42

1 Answers1

2
<a ?.*? href="(?:(?:.*?example\.com.*?)|(?:[^\.]*?))".*?>(.*?)<\/a>

This seems to work fine with the examples you've given. It's similar to what you wrote in your attempt but makes use of lazy quantifiers to prevent matching the unwanted stuff.

Example in action and full explanation.

BobbitWormJoe
  • 629
  • 4
  • 8