4

This example finds only sam. How to make it find both sam and samwise?

var regex = /sam|samwise|merry|pippin/g;
var string = 'samwise gamgee';
var match = string.match(regex);
console.log(match);

Note: this is simple example, but my real regexes are created by joining 500 keywords at time, so it's too cumbersome to search all overlapping and make a special case for them with something like /sam(wise)/. The other obvious solution I can think of, is to just iterate though all keywords individually, but I think it must be a fast and elegant, single-regex solution.

Alexander Vasenin
  • 11,437
  • 4
  • 42
  • 70

5 Answers5

2

You can use lookahead regex with capturing group for this overlapping match:

var regex = /(?=(sam))(?=(samwise))/;
var string = 'samwise';
var match = string.match( regex ).filter(Boolean);
//=> ["sam", "samwise"]
  • It is important to not to use g (global) flag in the regex.
  • filter(Boolean) is used to remove first empty result from matched array.
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • This works for the case I've wrote first, but unfortunately it returns `null` if string is `sam`. Moreover, if I add `(?=(merry))` to the regex it returns `null` to *every* possible string. – Alexander Vasenin Jul 18 '15 at 09:50
  • This requires that all keywords match simultaneously at the same offset, i.e. all keywords must be prefixes of each other. – melpomene Jul 18 '15 at 09:53
  • @AlexanderVasenin: For `samwise gamgee` input you still want to match `samwise` and `sam` only? – anubhava Jul 18 '15 at 10:03
  • String could be anything, if it's `merry and samwise are friends` I want it to match ["sam", "samwise", "merry"]. – Alexander Vasenin Jul 18 '15 at 10:10
  • 1
    In that `(?=.*?(merry))(?=.*?(sam))(?=.*?(samwise))` will work for you – anubhava Jul 18 '15 at 10:12
  • 2
    Sorry, but this version doesn't work with `merry` input. Turns out this a really hard case for a regex. I feel myself like a Truman hitting a boat against the world wall. – Alexander Vasenin Jul 18 '15 at 10:26
  • Hmm JS regex doesn't have fancy flavors of PCRE. I guess you will need to iterate the loop and check each keyword in input text. – anubhava Jul 18 '15 at 10:36
  • @anubhava Ironically the first book I've consulted was iconic [Programming Perl](http://www.amazon.com/Programming-Perl-Unmatched-processing-scripting-ebook/dp/B007S291SA/) and I haven't found answer there – Alexander Vasenin Jul 18 '15 at 11:28
1

Why not just map indexOf() on array substr:

var string = 'samwise gamgee';
var substr = ['sam', 'samwise', 'merry', 'pippin'];

var matches = substr.map(function(m) {
  return (string.indexOf(m) < 0 ? false : m);
}).filter(Boolean);

See fiddle console.log(matches);

Array [ "sam", "samwise" ]

Probably of better performance than using regex. But if you need the regex functionality e.g. for caseless matching, word boundaries, returned matches... use with exec method:

var matches = substr.map(function(v) {
  var re = new RegExp("\\b" + v, "i"); var m = re.exec(string); 
  return (m !== null ? m[0] : false);
}).filter(Boolean);

This one with i-flag (ignore case) returns each first match with initial \b word boundary.

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
0

I can't think of a simple and elegant solution, but I've got something that uses a single regex:

function quotemeta(s) {
    return s.replace(/\W/g, '\\$&');
}

let keywords = ['samwise', 'sam'];

let subsumed_by = {};
keywords.sort();
for (let i = keywords.length; i--; ) {
    let k = keywords[i];
    for (let j = i - 1; j >= 0 && k.startsWith(keywords[j]); j--) {
        (subsumed_by[k] = subsumed_by[k] || []).push(keywords[j]);
    }
}

keywords.sort(function (a, b) b.length - a.length);
let re = new RegExp('(?=(' + keywords.map(quotemeta).join('|') + '))[\\s\\S]', 'g');

let string = 'samwise samgee';

let result = [];
let m;
while (m = re.exec(string)) {
    result.push(m[1]);
    result.push.apply(result, subsumed_by[m[1]] || []);
}

console.log(result);
melpomene
  • 84,125
  • 8
  • 85
  • 148
0

How about:

var re = /((sam)(?:wise)?)/;
var m = 'samwise'.match(re); // gives ["samwise", "samwise", "sam"]
var m = 'sam'.match(re);     // gives ["sam", "sam", "sam"]

You can use Unique values in an array to remove dupplicates.

Community
  • 1
  • 1
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Sorry, you just trying to implement a special case for overlapping keywords. The idea is to handle them all equally, overlapping or not. – Alexander Vasenin Jul 18 '15 at 10:14
0

If you don't want to create special cases, and if order doesn't matter, why not first match only full names with:

\b(sam|samwise|merry|pippin)\b

and then, filter if some of these doesn't contain shorter one? for example with:

(sam|samwise|merry|pippin)(?=\w+\b)

It is not one elegant regex, but I suppose it is simpler than iterating through all matches.

m.cekiera
  • 5,365
  • 5
  • 21
  • 35