0

I have a tokenizer function that takes a string, a regex pattern for split and a arbitrary list of regex patterns to be protected from tokenization. To achieve that I'm using placeholder ____SSS____ to avoid those patterns to get split:

function tokenize(str,default_pattern,protected_patterns) {
       const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
       var screened = [];
       str = str.replace(screen, s => {
       var i = screened.push(s) - 1;
       return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
      });
      res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
      return res;
    }

By example, if I want to prevent that the pattern yo-ho to get split, I will do:

tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]

Of course I have to add the placeholder format ____SSS____(\d+)____SSS___ in the regex, otherwise the split takes place:

patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]

Now, for different languages I may have different split rules like

{
    "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
    "fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}

and I would like to dynamically add the ____SSS____(\d+)____SSS___ to each of them, but I do not find the right way to obtain this, so that the result should look like:

 {
      "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
      "fr" :  /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
 }

that will make the tokenizer with protected patterns to work properly.

loretoparisi
  • 15,724
  • 11
  • 102
  • 146

1 Answers1

1

You can simply capture the existing split rule like this:
(.+)(\].*)
and append your placeholder in-between the first and second capture group.

https://regex101.com/r/QCFnLS/1

Tytrox
  • 483
  • 2
  • 10