To combine one or more regex patterns in JavaScript I'm using the following function:
Tokenizer.prototype.combinePatterns = function() {
return new RegExp('(' + [].slice.call(arguments).map(function (e) {
var e = e.toString()
return '(?:' + e.substring(1, e.length - 1) + ')'
}).join('|') + ')', "gi")
};
This works fine. Now I want to "protect" some patterns, that means I want to exclude some patterns when executing the resulting regex. This means that I would like that the default_pattern
not to be applied to any of the patterns defined in the protected_patterns
array (this concept is taken from MOSES Tokenizer protected patterns option).
These protected patterns may or not be defined in the default patterns:
AggressiveTokenizer.prototype.tokenize = function(text, params = {}) {
var options = {
default_pattern: /[^a-z0-9äâàéèëêïîöôùüûœç]+/,
protected_patterns: []
};
for (var attr in params) options[attr] = params[attr];
var patterns = [].concat(options.protected_patterns).concat(options.default_pattern);
// LP: pass along all regex patterns as argument
patterns = this.combinePatterns.apply(this,patterns);
// break a string up into an array of tokens by anything non-word
return this.trim(text.split(patterns));
};
Following this approach, assumed to protect some pattern like
[ '\bla([- ]?la)+\b']
I get this combined regex from the result of combinePatterns
method:
/((?:la([- ]?la)+)|(?:[^a-z0-9äâàéèëêïîöôùüûœç]+))/gi
The result is not as expected, so by example in the (french) text having salut comment allez-vous la-la-la
, while I get the desidered la-la-la
token as a whole, I'm getting undefined
tokens, and a la-
as well:
var combinePatterns = function() {
return new RegExp('(' + [].slice.call(arguments).map(function(e) {
var e = e.toString()
return '(?:' + e.substring(1, e.length - 1) + ')'
}).join('|') + ')', "gi")
};
var tokenize = function(text, params = {}) {
var options = {
default_pattern: /[^a-z0-9äâàéèëêïîöôùüûœç]+/,
protected_patterns: []
};
for (var attr in params) options[attr] = params[attr];
var patterns = [].concat(options.protected_patterns).concat(options.default_pattern);
// LP: pass along all regex patterns as argument
patterns = this.combinePatterns.apply(this, patterns);
// break a string up into an array of tokens by anything non-word
return text.trim().split(patterns);
}
var text = "salut comment allez-vous la-la-la";
var res = tokenize(text, {
protected_patterns: ['\bla([- ]?la)+\b']
})
console.log(res)
My expected result should be
[
"salut",
"comment"
"allez"
"vous"
"la-la-la"
]
What is wrong: the protected patterns combination approach or the regex in the protected_patterns
array?
Tip:
I have noticed that, the combinePatterns
if applied only to the default_pattern
generated this regex
return this.trim(text.split(/((?:[^a-z0-9äâàéèëêïîöôùüûœç]+))/gi));
that slightly changes the resulting tokens of the default pattern:
return this.trim(text.split(/[^a-z0-9äâàéèëêïîöôùüûœç]+/i));