1

I have a regex expression that is matching URLs in a string which are not between quotes. This is working great but I have a minor issue with it.

The part that is dealing with the quotes is capturing the first character (can also be a white space) before the URL (usually https).

Here is the regex expression:

/(?:^|[^"'])(ftp|http|https|file):\/\/[\S]+(\b|$)/gim

You can test it out and you will see this unwanted match happening in front of the URL (if you type anything in front of the URL of course).

How do I get the proper Full match?

Lahey
  • 211
  • 2
  • 4
  • 11
  • Huh? Nothing before the `ftp` is capturing anything. If you mean *matching*, then yes, it's doing that. – Aran-Fey Sep 08 '18 at 19:49
  • 1
    You just need to wrap what you need to get with a capturing group and extract that group - `/(?:^|[^"'])((?:ftp|https?|file):\/\/\S+)(?:\b|$)/gim` and grab `match[1]`. Is that JavaScript? – Wiktor Stribiżew Sep 08 '18 at 19:50
  • @WiktorStribiżew I believe this is working. Group 1 is the proper match. Thank you! Do you have an explanation on why this unwanted character is appearing? – Lahey Sep 08 '18 at 19:53
  • 1
    Because the non-capturing group `(?:^|[^"'])` is matching the char other than `'` and `"` with `[^'"]` negated character class. It consumes that char, so it is added to the whole match value. – Wiktor Stribiżew Sep 08 '18 at 19:54
  • Ah yes of course! My knowledge regarding regex is limited as you may have guessed but it's good to know, thanks again! – Lahey Sep 08 '18 at 19:56

1 Answers1

2

The non-capturing group (?:^|[^"']) is matching and consuming the char other than ' and " with the [^'"] negated character class. As that char is consumed, it is added to the whole match value. What a capturing group does not do is adding the matched substring to a separate memory buffer, and thus you cannot access it later after a match is found.

The usual solutions are:

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563