I have 4 different cases of links which have other attributes, 3 of which I need to match and capture just the text of the link.
In short - the first 3 links need to be matched by href property and capture the text between the tags.
<a href="https://example.com/page_url" data-some-id="" data-other-prop="">Link 1</a>
<a data-href="" href="http://Go to page" data-another-id="">Link 2</a>
<a data-other="" href="/Go to page" data-val-id="">Link 3</a>
<a href="http://example123.com/page" data-props-id="">Link 4</a>
Regex needs to match:
- URLs that either contain 'example.com' (link 1 example), or
- Links that don't contain a domain (link 2 example), or
- If there's no schema e.g http (link 3 example)
- Non-href attributes can have different names, so 'data-', 'style="' and other properties can either be before or after href.
- It needs to be specific to anchor (
<a>
) tag
The 4th link shouldn't be captured. And the 4th link will always have different domain to link 1 (example.com).
I had plenty of attempts these 2 days, but can't get it right, generally the pipe(regex or) together with '.*' and a negative match gets me every time e.g.
<a.*(?:example\.com|(?!href="http?.*([\s])))+".*>(.*)<\/a>
It seems to be tougher than it looks to get the required match.
Note: this is for a response HTML in a string, and matching is happening before it's applied to DOM. So jQuery and DOM related solutions are out of the question, sorry. Progressive capturing using multiple expressions is welcome.