Find and change cyrillic word with boundary in google scripts

Question

The problem is that \b doesn't work with Russian and Ukrainian letters.

Here I try to find all matches of a word 'февраля' it the text, change them to tempword, then make it a link and change it back to 'февраля'.

function addLinks(word, siteurl) {
  var id = 'doc\'s ID';
  var doc = DocumentApp.openById(id);
  var body = doc.getBody();
  var tempword = 'ASDFDSGDDKDSL2';
  var searchText = "\\b"+word+"\\b";
  var element = body.findText(searchText);
  console.log(element);
  while (element) {
    var start = element.getStartOffset();
    var text = element.getElement().asText();
    text.replaceText(searchText, tempword);
    text.setLinkUrl(start, start + tempword.length - 1, siteurl);
    element = body.findText(searchText);
  }
  body.replaceText(tempword, word);
}

addLinks('февраля', 'example.com');

It works as it should, if I change Russian word 'февраля' to English 'february'.

addLinks('february', 'example.com');

I need regular expression, because if I just look for 'февраля' script will apply it to other words like 'февралям', 'февралями' etc. So, it is a question, how to make it work. Mistake "Exception: Invalid regular expression pattern" occurs with this code:

var searchText = "(?<=[\\s,.:;\"']|^)"+word+"(?=[\\s,.:;\"']|$)";

or this:

var searchText = "(^|\s)"+word+"(?=\s|$)";

and some other.

Here is one approach using a JavaScript regex: `(?<=[\s,.:;"']|^)февраля(?=[\s,.:;"']|$)`. Explanation [here](https://shiba1014.medium.com/regex-word-boundaries-with-unicode-207794f6e7ed). There is probably a SO question covering this already, but I did not find a good candidate. — andrewJames, Oct 09 '21 at 23:17
See https://stackoverflow.com/a/63391493/3832970. There are some other solutions, also present on SO. — Wiktor Stribiżew, Oct 09 '21 at 23:28
@andrewJames the question is about Google Apps Script. Your solution doesn't works in GAS. Since GAS doesn't support lookarounds and it's need to figure out a workaround — Yuri Khristich, Oct 10 '21 at 00:36
I found the workaround for GAS. I will post it as soon as the question will be reopened. — Yuri Khristich, Oct 10 '21 at 00:55
@YuriKhristich - Yes, agreed, if you try to use a GAS regex (e.g. inside `body.findText()`). The GAS regex syntax is indeed limited. But if you extract the body text to a variable you can use a JavaScript regex. Something like: `const text = doc.getBody().getText();` then `const found = text.match(/regex_from_my_comment/);` and then `console.log(found[0]);` - just as a _very_ basic example in a comment. You would have to replace sections of the document, this way, I think. But the regex does work. (Sounds like you may have a more elegant way). — andrewJames, Oct 10 '21 at 01:07
@andrewJames If you extract text you lost formatting of the text. To keep a formatting of Google Docs texts via pure JS is a hell. I've found the relative short and elegant solution, about 25 lines long. — Yuri Khristich, Oct 10 '21 at 01:26
@Tanaike please, help to reopen this question. I think I have the solution. — Yuri Khristich, Oct 10 '21 at 01:50
@Tanaike The example of text. Эники-беники февраля тратрата Но не нужно февралями и февралям. Также не подходят подфевраями или же ещё что-то. Нужно только февралями. — Vsevolod, Oct 10 '21 at 06:20
You can't make `"(^|\s)"` work because in GAS, the unknown string escape sequences get stripped from backslashes. You should have used `"(^|\\s)"` / `"($|\\s)"` (from the [linked question](https://stackoverflow.com/a/10590516/3832970)), which *is* in fact `(^|\s)` text. Besides, `(?:^|\\s)`. It won't work if you need to match words between non-word chars, it only matches words in between whitespaces. So, the solution I linked to is still the right one, you just need to replace lookarounds with consuming patterns, ``"([^\u0400-\u04ff]|^)" + tempword + "([^\u0400-\u04ff]|$)"``. — Wiktor Stribiżew, Oct 10 '21 at 09:19
@WiktorStribiżew did you try to apply your last solution on a test google document? The problem is how to keep those makeshift 'lookarounds' from changing. Given the GAS limitations. (By the way [A-яЁё] works fine, no need the numeric codes). The plain RegExp tricks not quite work in this particular case. — Yuri Khristich, Oct 10 '21 at 09:52
@YuriKhristich To be precise, GAS does support lookarounds. Google docs class method `.find*()` doesn't. — TheMaster, Oct 10 '21 at 09:53
@YuriKhristich See [*"I think next code does what is needed... At least in this situation."*](https://stackoverflow.com/a/69513101/3832970) If that is "working", then my suggestion is a "solution". I know pretty well that `replaceText` does not allow backreferences in the substitution and that Re2 does not support lookarounds, and even inline backreferences. I know about all possible lacks of constructs in RE2 compared to JS regex. — Wiktor Stribiżew, Oct 10 '21 at 09:54
@TheMaster thank you for bringing that up. Not sure if it helps in this case, though. — Yuri Khristich, Oct 10 '21 at 10:00
@WiktorStribiżew actually the main problem in this case was not that someone provided wrong or partial solution. It's okay. Nobody can know 'all possible lacks'. Nobody is obliged to read all tags carefully, etc. The problem was that the one closed the question after that. — Yuri Khristich, Oct 10 '21 at 10:35
Deleted my [solution](https://stackoverflow.com/questions/69511092/find-and-change-cyrillic-word-with-boundary-in-google-scripts/69513101#69513101), because it have issues. — Vsevolod, Oct 10 '21 at 18:08

Yuri Khristich · Accepted Answer · 2021-10-10T17:55:31.497

Here is my solution:

function main() {
  addLinks('февраля', 'example.com');
}

function addLinks(word, url) {
  var doc   = DocumentApp.getActiveDocument();
  var pgfs  = doc.getParagraphs();
  var bound = '[^А-яЁё]'; // any letter except Russian one

  var patterns = [
    {regex: bound + word + bound, start: 1, end: 1}, // word inside of line
    {regex: '^'   + word + bound, start: 0, end: 1}, // word at the start
    {regex: bound + word + '$',   start: 1, end: 0}, // word at the end
    {regex: '^'   + word + '$',   start: 0, end: 0}  // word = line
  ];

  for (var pgf of pgfs) for (var pattern of patterns) {
    var location = pgf.findText(pattern.regex);
    while (location) {
      var start = location.getStartOffset() + pattern.start;
      var end   = location.getEndOffsetInclusive() - pattern.end;
      pgf.editAsText().setLinkUrl(start, end, url);
      location = pgf.findText(pattern.regex, location);
    }
  }
}

Test output:

It handles well the word placed at the start or at the end of the line (or both). And it gives no the weird error message.

Checked several times on different words in different places. For now it works as it should. Deleted my solution, because it really has issues. — Vsevolod, Oct 10 '21 at 17:35

Find and change cyrillic word with boundary in google scripts

1 Answers1

Linked