3

If I have a string like this:

var str = "play the Ukulele in Lebanon. play the Guitar in Lebanon.";

I want to get the strings between each of the substrings "play" and "in", so basically an array with "the Ukelele" and "the Guitar".

Right now I'm doing:

var test = str.match("play(.*)in");

But that's returning the string between the first "play" and last "in", so I get "the Ukulele in Lebanon. Play the Guitar" instead of 2 separate strings. Does anyone know how to globally search a string for all occurrences of a substring between a starting and ending string?

Tushar
  • 85,780
  • 21
  • 159
  • 179
MarksCode
  • 8,074
  • 15
  • 64
  • 133

4 Answers4

9

You can use the regex

play\s*(.*?)\s*in

  1. Use the / as delimiters for regex literal syntax
  2. Use the lazy group to match minimal possible

Demo:

var str = "play the Ukulele in Lebanon. play the Guitar in Lebanon.";
var regex = /play\s*(.*?)\s*in/g;

var matches = [];
while (m = regex.exec(str)) {
  matches.push(m[1]);
}

document.body.innerHTML = '<pre>' + JSON.stringify(matches, 0, 4) + '</pre>';
Tushar
  • 85,780
  • 21
  • 159
  • 179
3

You are so close to the right answer. There are a few things you may be overlooking:

  1. You need your match to be non-greedy, this can be accomplished by using the ? operator
  2. Do not use the String.match() method as it's proven to match the entirety of the pattern and does not pay attention to capturing groups as you would expect. An alternative is to use RegExp.exec() or String.replace(), but using replace would require a little more work, so stick to building your own array with exec

var str     = "display the Ukulele in Lebanon. play the Guitar in Lebanon.";
var re      = /\bplay (.+?) in\b/g;
var matches = [];
var match;

while ( match = re.exec(str) ){
  matches[ matches.length ] = match[1];
}


document.getElementById('demo').innerHTML = JSON.stringify( matches );
<pre id="demo"></pre>
vol7ron
  • 40,809
  • 21
  • 119
  • 172
  • Thankyou sir, this is an excellent answer. Another user gave me the regex of `/play\s*(.*?)\s*in/g` but yours looks much simpler. The syntax looks pretty messy so I'm still trying to understand it. – MarksCode Feb 26 '16 at 06:07
  • I was busy typing and din't notice that @Tushar came to almost the same answer, except for the value assignment to the array. In JavaScript you can use ` `, `\s`, or `\ ` to all reference a space. Just be careful elsewhere, like Perl, where the ` ` could be ignored. Also `\s` refers to more than just whitespace, it could mean a tab or a newline character. – vol7ron Feb 26 '16 at 06:11
  • @vol7ron: I found some possible issues in your expression. I referenced it in my answer. – Jon Mar 01 '16 at 03:23
  • @Jon thanks, you're correct, this could use word boundaries. Keep in mind that even word boundaries could have issues with hyphenations. The most robust solution would require many more lines of logic - or a negative lookbehind (which I don't think ECMAScript RegEx permits). So this also requires the OP to be more specific about the string(s) being evaluated. That said, the `\b` would be a good thing to include. – vol7ron Mar 01 '16 at 03:39
  • @vol7ron: Yes, `\b` can have issues with many special characters. It's possible that I'm making more of this than needs to be as the string the OP is dealing with may vary little from what he provided above, in which case `\b` would be unnecessary. Also, his question may have been really just about greedy vs lazy. But I suppose that while the issue of potential problems with `\b` has been raised (and as you alluded, without knowing more about possible variation in his input string), maybe the following would be safer: `/(?:\s|^)play\s+(.+?)\s+in\s/ig`. – Jon Mar 01 '16 at 04:17
2

/\bplay\s+(.+?)\s+in\b/ig might be more specific and might work better for you.

I believe there may be some issues with the regexes offered previously. For instance, /play\s*(.*?)\s*in/g will find a match within "displaying photographs in sequence". Of course this is not what you want. One of the problems is that there is nothing specifying that "play" should be a discrete word. It needs a word boundary before it and at least one instance of white space after it (it can't be optional). Similarly, the white space after the capture group should not be optional.

The other expression offered at the time I added this, /play (.+?) in/g, lacks the word boundary token before "play" and after "in", so it will contain a match in "display blue ink". This is not what you want.

As to your expression, it was missing the word boundary and white space tokens as well. But as another mentioned, it also needed the wildcard to be lazy. Otherwise, given your example string, your match would start with the first instance of "play" and end with the 2nd instance of "in".

If issues with my offered expression are found, would appreciate feedback.

Jon
  • 814
  • 2
  • 8
  • 11
0

A victim of greedy matching.

.* finds the longest possible match,

while .*? finds the shortest possible match.

For the example given str will be an array or 3 strings containing:

    the Ukelele
    the Guitar
    Lebanon
Arif Burhan
  • 507
  • 4
  • 12