2

So, I have a JavaScript regexp like this:

/url:.?(['"])(https?:\/\/.*?)\1/

I use it to find specific url inside html/js code. As you can see I capture link inside either '' or "". This is a problem, because I don't want to get links like 'http://'.

/url:.?(['"])(https?:\/\/.+)\1/

This also picks stuff like 'http://"+d+', also bad.

I'd like to be able to say in the regex something like this:

/(['"])(https?:\/\/[^\1]+)\1/

To use [^\1] instead of a dot, to only get whatever is inside '' or "", making sure it does not pick up 'http://"+d+'

Is there a way to do stuff like this?

CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
Alex L.
  • 101
  • 11
  • are you saying you want to match for the url inside the quotes, but only if it is a valid url format? – CrayonViolent Nov 17 '16 at 14:24
  • As neither apostrophe `'` nor quote character `"` are valid in an URL (among other characters), you can just use `[^"']` (simplified). See http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid for all allowed characters. – Paul Nov 17 '16 at 14:28
  • 1
    I'm not sure what you are actually trying to accomplish, but to answer your question about equivalent of `[^\1]+`, you can do this: `(['"])((?:(?!\1).)+)` – CrayonViolent Nov 17 '16 at 14:28
  • I feel you actually do not need the backreference, just use the negated character class - [`/(['"])(https?:\/\/[^'"]+)\1/`](https://regex101.com/r/Z10mGg/2). – Wiktor Stribiżew Nov 17 '16 at 14:30
  • Is this in the browser and you are searching through the InnerHTML of something? – Tomalak Nov 17 '16 at 14:30
  • Hey, yes this is code intended for the browser, which first goes through my node.js app first. I am trying to find valid urls, however I decode them from the strict URL format and so it can actually have ' or " inside, as part of query string. – Alex L. Nov 17 '16 at 14:38
  • The [`(['"])((?:(?!\1).)+)`](https://regex101.com/r/q5wed4/1) does not seem to be what is necessary here. – Wiktor Stribiżew Nov 17 '16 at 14:42
  • @AlexL.: Any feedback? Did my comment help? – Wiktor Stribiżew Nov 17 '16 at 22:07
  • Hey, no, not really. It was all helpful, but apparently this specific case is not possible in plain regex. Gone with [^'"]. – Alex L. Nov 21 '16 at 09:42
  • @AlexL. I see, but see my answer below now. It actually resolves the issue you stated in the question. – Wiktor Stribiżew May 06 '20 at 08:56

1 Answers1

0

Note that [^\1] matches any char but a \x01 char (SOH, start of heading). That is because inside character classes, \ + digit cannot define a backreference. See ECMAScript reference:

Inside a CharacterClass, \b means the backspace character, while \B and backreferences raise errors.

Actually, in JS implementation, as you see, \1 inside the [...] class forms an octal escape (see Using special characters).

In your case, you just want to match any char but ' and " with [^'"], you do not need to check for the previously matched qualifier:

/(['"])(https?:\/\/[^'"]+)\1/

See the regex demo

Details

  • (['"]) - Group 1: a ' or "
  • (https?:\/\/[^'"]+) - Group 2: http, an optional s, ://, 1 or more chars other than ' and "
  • \1 - the value of Group 1.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563