1

I need ur help :(
What I want:
Match string if url.text AND url.href both contains URL, which are not equal (without protocol and subdomains).

It should work like this:

<a href="http://www.test1.net/dir1/index.html" target="_blank">test1.net/admin</a> <-- NOT MATCH
<a href="https://test2.com">THIS SITE</a> <-- NOT MATCH
<a href="https://subdomain.test3.org">test2.org</a> <-- MATCH
<a href="http://www2.test4.com" target="_blank">https://global.test4.com/index.html</a> <-- NOT MATCH
<a href="http://eu.test5.com">https://evil.com/eu.test5.com/</a> <-- MATCH
<a href="http://eu.site6.com/index.html" target="_blank">https: // eu. evil. com</a> <-- MATCH
<a href="https://site7.com/">http://www.site7.com/123/test</a> <-- NOT MATCH

I started write something like this, but I had a problem with my code doing the opposite.
Help me figure out how to make what I want.

refregerator
  • 13
  • 1
  • 3
  • 1
    [Regex is not the best fit to use on HTML](https://stackoverflow.com/q/1732348/479156). Can't you use an HTML parser instead? – Ivar Sep 14 '19 at 15:15
  • 2
    I can help with the regex. But you'd have to explain what this means `I had a problem with my code doing the opposite`. Show some specific examples of what you _DO_ and _DO NOT_ want to match and _WHY_ –  Sep 14 '19 at 17:36
  • @Ivar No, I can't use anything except RegEx :( – refregerator Sep 14 '19 at 19:05
  • @sin The code which I shared marks strings if url.text and url.href are equal. I don't need this. I need to mark unequal things as I wrote in 'code' section above. – refregerator Sep 14 '19 at 19:05
  • @refrigerator - I actually requested some exact examples _with_ explanations since seeing your examples have contradictions. And, although we want to help, nobody wants to waste their time... –  Sep 14 '19 at 21:33

1 Answers1

0

Your original expression is pretty well-designed, yet I would have used some statements such as:

(?!.*\1.*)

or:

(?!((?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?(\1)).*)

within, to bypass the same domain in the url.text, maybe with some expression similar to:

(?i)<a\s+href="(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?"[^>]*>(?!.*\1.*)(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?.*?<\/a>

or probably and more accurately with:

(?i)<a\s+href="(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?"[^>]*>(?!((?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?(\1)).*)(?:https?:\s*\/\/\s*)?(?:\s*w{3}\.\s*)?(?:[^"\/]*\.\s*)?([a-z0-9_-]+\s*\.\s*[a-z0-9_-]{2,6}\s*)(\/[^"]*)?.*?<\/a>

which you'd most likely want to modify, and change the boundaries. For instance, you can add \s* anywhere you'd want to allow some spaces, or maybe with a double-bounded quantifier \s{0,5}.

Demo


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    Awesome! But third match doesn't work correctly. You're delete spaces between words in url.text, but I need with them. [DEMO](https://regex101.com/r/fZi1HH/2) – refregerator Sep 14 '19 at 20:59