I have a tokenizer function that takes a string, a regex pattern for split
and a arbitrary list of regex patterns to be protected from tokenization. To achieve that I'm using placeholder ____SSS____
to avoid those patterns to get split:
function tokenize(str,default_pattern,protected_patterns) {
const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
var screened = [];
str = str.replace(screen, s => {
var i = screened.push(s) - 1;
return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
});
res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
return res;
}
By example, if I want to prevent that the pattern yo-ho
to get split, I will do:
tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]
Of course I have to add the placeholder format ____SSS____(\d+)____SSS___
in the regex, otherwise the split takes place:
patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]
Now, for different languages I may have different split rules like
{
"es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
"fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}
and I would like to dynamically add the ____SSS____(\d+)____SSS___
to each of them, but I do not find the right way to obtain this, so that the result should look like:
{
"es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
"fr" : /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
}
that will make the tokenizer
with protected patterns to work properly.