6

I have this regex to extract double words from text

/[A-Za-z]+\s[A-Za-z]+/g

And this sample text

Mary had a little lamb

My output is this

[0] - Mary had; [1] - a little;

Whereas my expected output is this:

[0] - Mary had; [1] - had a; [2] - a little; [3] - little lamb

How can I achieve this output? As I understand it, the index of the search moves to the end of the first match. How can I move it back one word?

  • FYI: `\w+` matches a word. Easier and more full-proof than `[a-zA-Z]+` :-) – Florian Margaine Dec 29 '12 at 13:49
  • Since there is overlapping portion, the regex must have look-ahead to avoid consuming the input, and you must capture the text matched inside the look-ahead. `split` won't work, since JS split function ignores capturing group, which is necessary to pick out the overlapping portion (some language like C# or Ruby will include the captured text). `match` also won't work, since it will ignore capturing groups with `g` flag. Not sure if there is any other way that lets you work with regex. – nhahtdh Dec 29 '12 at 13:50
  • @FlorianMargaine: `\w` and `[a-zA-Z]` are totally different things. `\w`, when written in character class will be `[a-zA-Z0-9_]` will match English alphabet, digits and underscore `_`. Foolproof or not, it depends on OP's requirement. – nhahtdh Dec 29 '12 at 13:51
  • @nhahtdh OP requirements are "extract double words". Clearly more foolproof for his requirements. And yes, I over simplified a bit, but the end remains the same: he was looking for `\w+`. – Florian Margaine Dec 29 '12 at 13:54
  • @FlorianMargaine: "Word" can be defined differently. I can say: `234` or `234_sdfk` or `sdf_sdfs` are one word. But I can also say they are not one word. Again, it depends on requirement. – nhahtdh Dec 29 '12 at 13:56
  • 1
    People seem to be completely unaware of [`RegExp.lastIndex`](https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp/lastIndex). – katspaugh Dec 29 '12 at 13:57
  • You shouldn't do it with regexps, but just for fun: https://tinker.io/9e819 (see [here](http://chat.stackoverflow.com/transcript/message/6916588#6916588)) – Zirak Dec 29 '12 at 14:12

6 Answers6

6

Abusing String.replace function

I use a little trick using the replace function. Since the replace function loops through the matches and allows us to specify a function, the possibility is infinite. The result will be in output.

var output = [];
var str = "Mary had a little lamb";
str.replace(/[A-Za-z]+(?=(\s[A-Za-z]+))/g, function ($0, $1) {
    output.push($0 + $1);
    return $0; // Actually we don't care. You don't even need to return
});

Since the output contains overlapping portion in the input string, it is necessary to not to consume the next word when we are matching the current word by using look-ahead 1.

The regex /[A-Za-z]+(?=(\s[A-Za-z]+))/g does exactly as what I have said above: it will only consume one word at a time with the [A-Za-z]+ portion (the start of the regex), and look-ahead for the next word (?=(\s[A-Za-z]+)) 2, and also capture the matched text.

The function passed to the replace function will receive the matched string as the first argument and the captured text in subsequent arguments. (There are more - check the documentation - I don't need them here). Since the look-ahead is zero-width (the input is not consumed), the whole match is also conveniently the first word. The capture text in the look-ahead will go into the 2nd argument.

Proper solution with RegExp.exec

Note that String.replace function incurs a replacement overhead, since the replacement result is not used at all. If this is unacceptable, you can rewrite the above code with RegExp.exec function in a loop:

var output = [];
var str = "Mary had a little lamb";
var re = /[A-Za-z]+(?=(\s[A-Za-z]+))/g;
var arr;

while ((arr = re.exec(str)) != null) {
    output.push(arr[0] + arr[1]);
}

Footnote

  1. In other flavor of regex which supports variable width negative look-behind, it is possible to retrieve the previous word, but JavaScript regex doesn't support negative look-behind!.

  2. (?=pattern) is syntax for look-ahead.

Appendix

String.match can't be used here since it ignores the capturing group when g flag is used. The capturing group is necessary in the regex, as we need look-around to avoid consuming input and match overlapping text.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    A comparison of all the methods available here: http://jsperf.com/split-overlapping-regex. **Please read the note before running the test!!!** If the input space is not constrained to only alphabet and space, the output can be different. – nhahtdh Dec 29 '12 at 20:38
4

It can be done without regexp

"Mary had a little lamb".split(" ")
      .map(function(item, idx, arr) { 
          if(idx < arr.length - 1){
              return item + " " + arr[idx + 1];
          }
       }).filter(function(item) {return item;})
Yury Tarabanko
  • 44,270
  • 9
  • 84
  • 98
  • @xtofl that's a C# function, not a javascript function. – John Dvorak Dec 29 '12 at 13:11
  • Wow! That function works really well! I hadn't considered a non-RegEx solution because regex gives me easy controls to so many other aspects of the text. I am tempted to accept this as the answer, but I would like to know if it can be done purely through RegEx – Conversation Company Dec 29 '12 at 13:16
  • @Con, it can't be done purely in regular expressions, but most languages (JavaScript is an exception) allow you to set the starting location for a search. You could do it with substrings but it'd be super messy and super slow. The above probably runs a bit faster. – Brigand Dec 29 '12 at 13:30
2

Here's a non-regex solution (it's not really a regular problem).

function pairs(str) {
  var parts = str.split(" "), out = [];
  for (var i=0; i < parts.length - 1; i++) 
    out.push([parts[i], parts[i+1]].join(' '));
  return out;
}

Pass your string and you get an array back.

demo


Side note: if you're worried about non-words in your input (making a case for regular expressions!) you can run tests on parts[i] and parts[i+1] inside the for loop. If the tests fail: don't push them onto out.

Brigand
  • 84,529
  • 20
  • 165
  • 173
  • 1
    Unless the constraint of this problem is to solve it solely with square pegs and round holes :), this is the best answer. – full.stack.ex Dec 29 '12 at 19:34
1

A way that you could like could be this one:

var s = "Mary had a little lamb";

// Break on each word and loop
s.match(/\w+/g).map(function(w) {

    // Get the word, a space and another word
    return s.match(new RegExp(w + '\\s\\w+'));

// At this point, there is one "null" value (the last word), so filter it out
}).filter(Boolean)

// There, we have an array of matches -- we want the matched value, i.e. the first element
.map(Array.prototype.shift.call.bind(Array.prototype.shift));

If you run this in your console, you'll see ["Mary had", "had a", "a little", "little lamb"].

With this way, you keep your original regex and can do the other stuff you want in it. Although with some code around it to make it really work.

By the way, this code is not cross-browser. The following functions are not supported in IE8 and below:

  • Array.prototype.filter
  • Array.prototype.map
  • Function.prototype.bind

But they're easily shimmable. Or the same functionality is easily achievable with for.

Florian Margaine
  • 58,730
  • 15
  • 91
  • 116
0

Here we go:

You still don't know how the regular expression internal pointer really works, so I will explain it to you with a little example:

Mary had a little lamb with this regex /[A-Za-z]+\s[A-Za-z]+/g

Here, the first part of the regex: [A-Za-z]+ will match Mary so the pointer will be at the end of the y

Mary had a little lamb
    ^

In the next part (\s[A-Za-z]+) it will match an space followed by another word so...

Mary had a little lamb
        ^

The pointer will be where the word had ends. So here's your problem, you are increasing the internal pointer of the regular expression without wanting, how is this solved? Lookaround is your friend. With lookarounds (lookahead and lookbehind) you are able to walk through your text without increasing the main internal pointer of the regular expression (it would use another pointer for that).

So at the end, the regular expression that would match what you want would be: ([A-Za-z]+(?=\s[A-Za-z]+))

Explanation:

The only think you dont know about that regular expression is the (?=\s[A-Za-z]+) part, it means that the [A-Za-z]+ must be followed by a word, else the regular expression won't match. And this is exactly what you seem to want because the interal pointer will not be increased and will match everyword but the last one because the last one won't be followed by a word.

Then, once you have that you only have to replace whatever you are done right now.

Here you have a working example, DEMO

Javier Diaz
  • 1,791
  • 1
  • 17
  • 25
0

In full admiration of the concept of 'look-ahead', I still propose a pairwise function (demo), since it's really Regex's task to tokenize a character stream, and the decision of what to do with the tokens is up to the business logic. At least, that's my opinion.

A shame that Javascript hasn't got a pairwise, yet, but this could do it:

function pairwise(a, f) {
  for (var i = 0; i < a.length - 1; i++) {
     f(a[i], a[i + 1]);
  }
}

var str = "Mary had a little lamb";

pairwise(str.match(/\w+/g), function(a, b) {
  document.write("<br>"+a+" "+b);
});

​
xtofl
  • 40,723
  • 12
  • 105
  • 192