1

I have a bit of a strange one here, I basically have a large chunk of text which may or may not contain links to images.

So lets say it does I have a pattern which will extract the image url fine, however once a match is found it is replaced with a element with the link as the src. Now the problem is there may be multiple matches within the text and this is where it gets tricky. As the url pattern will now match the src tags url, which will basically just enter an infinite loop.

So is there a way to ONLY match in regex if it doesnt start with a pattern like ="|=' ? as then it would match the url in something like:

some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6

but not

some image <img src="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6">

I am not sure if it is possible, but if it is could someone point me in the right direction? A replace by itself will not suffice in this scenario as the url matched needs to be used elsewhere too so it needs to be used like a capture.

The main scenarios I need to account for are:

  • Many links in one block of varied text
  • A single link without any other text
  • A single link with other varied text

== edit ==

Here is the current regex I am using to match urls:

(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

== edit 2 ==

Just so everyone understands why I cannot use the /g command here is an answer which explains the issue, if I could use this /g like I originally tried then it would make things a lot simpler.

Javascript regex multiple captures again

Community
  • 1
  • 1
Grofit
  • 17,693
  • 24
  • 96
  • 176
  • 2
    Have you tried using the `/g` command, which should do a single global replace, rather than having to loop through until a match is "not found"? – freefaller Sep 27 '13 at 09:37
  • In javascript it doesnt seem to work, there is some problem with multiple captures and exec, so you need to loop round until no matches remain. I read something about JS doesnt support captures or multiple matches in a single result, although if you can prove the above in a jsfiddle or something I will happily give you the answer as I could never get it to work. – Grofit Sep 27 '13 at 09:40
  • Why is there a downvote to the question, this is a well defined question given the constraints and the scenario. – Grofit Sep 27 '13 at 09:52
  • 1
    [try this jQuery based jsfiddle](http://jsfiddle.net/EnfRb/)... although it does highlight that the query string part of the string isn't taken into account. If you want vannilla JS, [this this jsfiddle](http://jsfiddle.net/yKmKH/) – freefaller Sep 27 '13 at 09:57

4 Answers4

3

What you are looking for is a negative look behind, but Javascript doesn't support any kind of look behinds, so you will either have to use a callback function to check what was matched and make sure it is not preceded by a ' or ", or you can use the following regex:

(?:^|[^"'])(\b(https?|ftp|file):\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

which has a single problem, that is in the case of a successful match it will catch one more character, the one right before the (\b(https?|ftp|file) pattern in the input, but I think you can deal with this easily.

Regex101 Demo

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
  • this seems to work and addresses the questions context slightly better, as the other answers which are very useful are less about tackling the pattern at the start and changing tact to get the replace to work in 1 go. – Grofit Sep 27 '13 at 10:14
1

Using the /ig command at the end should work... the g is for global replace and the i is for case-insensitivity, which is necessary as you've only got A-Z instead of a-zA-Z.

Using the following vanilla JS appears to work for me (see jsfiddle)...

var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");

Although, what it does highlight is that the query string part of the URL (the ?v=6 is not being picked up with your RegEx).

For jQuery, it would be (see jsfiddle)...

$(document).ready(function(){
  var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
  var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
  $("#output").html(test.replace(re,"<img src=\"$1\"/>"));
});

Update

Just in case my example of using the same image URL in the example doesn't convince you - it also works with different URLs... see this jsfiddle update

var test="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 http://cdn.sstatic.net/serverfault/img/sprites.png?v=7";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
freefaller
  • 19,368
  • 7
  • 57
  • 87
  • Interesting, although the replace works how do you actually access the underlying match so you can make use of the captures data when doing it this way? – Grofit Sep 27 '13 at 10:07
  • That's a good question @Grofit, and I'm sorry but I'm simply not aware of how you'd do that. The replace is based on simple pattern matching... if you need to explicit processing on each individual match then I **believe** (but am happy to be proved wrong) that you would have to do individual matches. If I'm right, I think there is a way to call an external function, but I've never done it and cannot give any advice in that direction... sorry! – freefaller Sep 27 '13 at 10:11
  • That is fine buddy, if the question was simply about doing the replace then you would get the answer given javascript's limitations, however as the match still needs to be used outside of the replace I have given the answer to the other chap, but upvoted as im sure for most cases this would be the more applicable answer for most people doing similar. – Grofit Sep 27 '13 at 10:16
  • @Grofit, not a problem fella, but it wasn't clear from your OP that you needed the ability to do extra processing on those matches. Good luck with the rest of your project :-) – freefaller Sep 27 '13 at 10:17
0

Couldn't you just see if there is a whitespace in front of the url, instead of that word-boundary? seems to work, although you will have to remove the matched whitespace later.

(\s(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

http://rubular.com/r/9wSc0HNWas

Edit: Damn, too slow :) I'll still leave this here as my regex is shorter ;)

Tomke
  • 38
  • 1
  • 9
  • what if the text was just a link, which had no whitespace before it. In that case it would not work :( – Grofit Sep 27 '13 at 10:00
  • That's true, I did not know you expected something like this... Would you expect something like: here is some texthttp://.... ? – Tomke Sep 27 '13 at 10:17
  • Nah, that is not too much of a worry as its a rare case and too hard to test for, it was mainly just the case of a link being posted as the sole content which I wanted to point out, but you are right it was not specifically mentioned on the question. – Grofit Sep 27 '13 at 12:13
0

as was said by freefaller, you might use /g flag to just find all matches in one go, if exec is not a must.

otherwise: you can add (="|=')? to the beginning of your regex, and check if $1 is undefined. if it is undefined, then it was not started with a ="|=' pattern

GuiDocs
  • 722
  • 1
  • 6
  • 12
  • the reason I cannot use the /g is explained here in the answer: http://stackoverflow.com/questions/14707360/javascript-regex-multiple-captures-again – Grofit Sep 27 '13 at 10:00
  • my answer works even if `exec` is a must, but you could just use `match` or `replace` – GuiDocs Sep 27 '13 at 10:18