0

I'm looking for a regexp which is able to match words n by n. Let's say n := 2, it would yield:

Lorem ipsum dolor sit amet, consectetur adipiscing elit

Lorem ipsum, ipsum dolor, dolor sit, sit amet (notice the comma here), consectetur adipiscing, adipiscing elit.

I have tried using \b for word boundaries to no avail. I am really lost trying to find a regex capable of giving me n words... /\b(\w+)\b(\w+)\b/i can't cut it, and even tried multiple combinations.

Jo Colina
  • 1,870
  • 7
  • 28
  • 46
  • Possible duplicate of [Learning Regular Expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions) – Biffen Nov 13 '16 at 10:10
  • @Biffen how is it a duplicate of that question? – Jo Colina Nov 13 '16 at 10:13
  • This is basically a *give-me-a-regex* ‘question’. They're all duplicates (in a way) of that one. – Biffen Nov 13 '16 at 10:14
  • @Biffen, even though I really like your philosophy, I am really lost trying to find a regex capable of giving me n words... `/\b(\w+)\b(\w+)\b/i` can't cut it, and even tried multiple combinations. – Jo Colina Nov 13 '16 at 10:19
  • 1
    You need overlapping matches and `\W+` between words. Check https://jsfiddle.net/ncxucvfk/ – Wiktor Stribiżew Nov 13 '16 at 10:20
  • 1
    @JoColina I don't think you've quite grasped how `\b` works: `(\w+)\b(\w+)` can't ever match anything, since there is never, by definition, a word boundary (`\b`) between to word characters (`\w`). You're going to have to take non-word characters like whitespace and punctuation into account. – Biffen Nov 13 '16 at 10:21
  • @Biffen ok, see it now, \W+ is the trick, however @Wiktor, I'm getting `amet, consectetur`. Gonna pop them from the array though! Thanks a lot – Jo Colina Nov 13 '16 at 10:26
  • I did not pay attention: so, the only words you need are separated with whitespace? Then, you need `\s+`, not `\W+` between. – Wiktor Stribiżew Nov 13 '16 at 10:31
  • I'm puzzled by why you seem to think regexp is relevant here, other than possibly to break the sentence into words. Once you have words, it's a simple affair to create the "n-grams" (which is what your n-word groups are called). –  Nov 13 '16 at 12:17
  • @Biffen No, it's a *give-me-a-regexp-even-though-that's-not-what-I-really-need* question. –  Nov 13 '16 at 12:18

3 Answers3

0

A word boundary \b does not consume any characters, it is a zero-width assertion, and only asserts the position between a word and non-word chars, and between start of string and a word char and between a word char and end of string.

You need to use \s+ to consume whitespaces between words, and use capturing inside a positive lookahead technique to get overlapping matches:

var n = 2;
var s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";
var re = new RegExp("(?=(\\b\\w+(?:\\s+\\w+){" + (n-1) + "}\\b))", "g");
var res = [], m;
while ((m=re.exec(s)) !== null) { // Iterating through matches
 if (m.index === re.lastIndex) {  // This is necessary to avoid 
        re.lastIndex++;           // infinite loops with 
 }                                // zero-width matches
 res.push(m[1]);                  // Collecting the results (group 1 values)
}
console.log(res);

The final pattern will be built dynamically since you need to pass a variable to the regex, thus you need a RegExp constructor notation. It will look like

/(?=(\b\w+(?:\s+\w+){1}\b))/g

And it will find all locations in the string that are followed with the following sequence:

  • \b - a word boundary
  • \w+ - 1 or more word chars
  • (?:\s+\w+){n} - n sequences of:
    • \s+ - 1 or more whitespaces
    • \w+ - 1 or more word chars
  • \b - a trailing word boundary
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Regular expressions are not really what you need here, other than to split the input into words. The problem is that this problem involves matching overlapping substrings, which regexp is not very good at, especially the JavaScript flavor. Instead, simply break the input into words, and a quick piece of JavaScript will generate the "n-grams" (which is the correct term for your n-word groups).

const input = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";

// From an array of words, generate n-grams.
function ngrams(words, n) {
  const results = [];

  for (let i = 0; i < words.length - n + 1; i++) 
    results.push(words.slice(i, i + n));

  return results;
}

console.log(ngrams(input.match(/\w+./g), 2));
-1

Not a pure regex solution, but it works and is easy to read and understand:

let input = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit';
let matches = input.match(/(\w+,? \w+)/g)
    .map(str => str.replace(',', ''));

console.log(matches) // ['Lorem ipsum', 'dolor sit', 'amet consectetur', 'adipiscing elit']

Warning: Does not check for no matches (match() returns null)

jhenninger
  • 687
  • 7
  • 12