1

I have wrote this function that aims to replace words or phrases in a text document with a specified expression expr given a set of tokens to be matched. The document is newline formatted.

function replaceTokens(text, tokens, expr, isline = false) {
  tokens.forEach(word => {
    if (expr[token]) {
      if (isline) { // line regex
        text = text.replace(new RegExp("(" + word.replace(/([\(\)'?*!"])/g, "\\$1") + ")", "gi"), expr);
      } else {
        text = text.replace(new RegExp("(" + word + ")", "gi"), expr[token]);
      }
    }
  });
  return text;
}

I'm facing two problems.

1) For words tokens like Lorem, qui, etc. it works pretty ok, but I cannot get rid of the whole token case i.e. I do not want to match qui within a word like quis, but only the given token in the text. Using ^word$ does not work here with capture group ^(word)$

[1 - SOLVED] according to first answer with new RegExp("\\b(" + word + ")\\b", "gi")

2) For phrases tokens, the regex I'm using does not work properly. I want to match the exact line like Lorem ipsum dolor sit amet in

Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet etwas

it should match the first line only, not the second line as well.

Here is an example. For (1) you can see how qui is captured as a token and within the word quis or aliquip.

function replaceTokens(text, tokens, expr, isline = false) {
  tokens.forEach(word => {
    if (isline) { // line regex
      text = text.replace(new RegExp("(" + word.replace(/([\(\)'?*!"])/g, "\\$1") + ")", "gi"), expr);
    } else {
      text = text.replace(new RegExp("\\b(" + word + ")\\b", "gi"), expr);
    }
  });
  return text;
}

text = "Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\nsed do eiusmod tempor incididunt\nut labore et dolore magna aliqua.\nUt enim ad minim veniam,\nquis nostrud exercitation ullamco laboris nisi\nut aliquip ex ea commodo consequat.\nDuis aute irure dolor in reprehenderit in voluptate velit esse\ncillum dolore eu fugiat nulla pariatur.\nExcepteur sint occaecat cupidatat non proident,\nLorem ipsum dolor sit amet etwas,\nsunt in culpa qui officia deserunt mollit anim id est laborum"

out = replaceTokens(text, ["Lorem", "ut", "qui"], "<strong>$1</strong>", false)
out_phrases = replaceTokens(text, ["Lorem ipsum dolor sit amet", "Duis aute irure dolor in reprehenderit"], "<strong>$1</strong>", true)
document.getElementById("in_text").innerHTML = text.replace(/\n/g, '<br/>')
document.getElementById("out_text").innerHTML = out.replace(/\n/g, '<br/>')
document.getElementById("out_phrases").innerHTML = out_phrases.replace(/\n/g, '<br/>')
<div id="in_text"></div>
<hr>
<div id="out_text"></div>
<hr>
<div id="out_phrases"></div>

Addded jsfiddle snippet to try it out.

loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • 1
    Is the problem in the second case is that some part of the phrase may go into the next line preventing a match? The code snippet seems to match the second case with no problem. – jrook Oct 16 '18 at 17:32
  • 1
    Your second problem isn't clear. What are you trying to do and what is the expected output? – revo Oct 16 '18 at 17:32
  • Right, sorry. The second question is about matching a `whole line` passed in the tokens array to to lines in the text, where the latter could be different in some cases: when having `Lorem ipsum dolor sit amet etwas.\nLorem ipsum dolor sit amet` I only want to match the last one with the token `Lorem ipsum dolor sit amet`. – loretoparisi Oct 16 '18 at 17:34
  • For second enable `m` flag: `/^PATTERN$/m` – revo Oct 16 '18 at 17:38
  • @revo since I'm using a capture group like `/(phrase)/` how it will be in that case? – loretoparisi Oct 16 '18 at 17:39
  • `/^(PATTERN)$/m` – revo Oct 16 '18 at 17:41
  • You could also insert separators into the regex: `Lorem[\n\s]ipsum[\n\s]hello` – jrook Oct 16 '18 at 17:42
  • 1
    `[\n\s]` is equal to `\s` – revo Oct 16 '18 at 17:42
  • using `RegExp` constructor, the place you should define flags is in the second argument where you have `gi` right now. So it should be `gim` and you should remove slashes. – revo Oct 16 '18 at 17:44
  • @revo with `text = text.replace(new RegExp("^(" + word.replace(/([\(\)'?*!"])/g, "\\$1") + ")$/m", "gi"), expr);` it will not work. – loretoparisi Oct 16 '18 at 17:44
  • 1
    Try `new RegExp("^(" + word.replace(/([()'?*!"])/g, "\\$1") + ")$", "gim")` – revo Oct 16 '18 at 17:46
  • Using the `gim` flags and `/^(PATTERN)$/` I'm still getting `Lorem ipsum dolor sit amet` matched within `Lorem ipsum dolor sit amet etwas,` – loretoparisi Oct 16 '18 at 17:48
  • 1
    Please provide a fiddle in your question like the one you made for the first problem so we can see. – revo Oct 16 '18 at 17:49
  • 1
    https://regex101.com/r/mGkf8r/1/ based on my first idea with @revo 's comment applied – jrook Oct 16 '18 at 17:51
  • @revo I have added a JSFiddle https://jsfiddle.net/2sftun0L/ – loretoparisi Oct 16 '18 at 17:55
  • 1
    You have a comma before `\n` that's why the first sentence doesn't match because that comma isn't included in regex. I removed that comma see what happens now https://jsfiddle.net/2sftun0L/1/ – revo Oct 16 '18 at 18:13
  • How about this one? https://regex101.com/r/mGkf8r/3. It matches the phrase only if it is in a single line. – jrook Oct 16 '18 at 18:17
  • @jrook it will match contained phrases like `Lorem ipsum dolor sit amet` within `Lorem ipsum dolor sit amet etwas`. I only want exact phrase match, that should permit capture group as well like in the single word case: `\b(^word$)\n`, but this pattern does not work for phrases. – loretoparisi Oct 17 '18 at 07:26
  • @revo right, problem is that in text text punct are admitted so I could have comma or other punct before `\n` line terminator, like `Lorem ipsum dolor sit amet,` or something else, so the regex should take this in account in some way. – loretoparisi Oct 17 '18 at 07:29
  • To consider trailing punctuation marks see this https://jsfiddle.net/2sftun0L/3/ – revo Oct 17 '18 at 07:32

1 Answers1

1

The first question seems pretty clear, Wrap your Regex string in '\b' (Word boundary):

      text = text.replace(new RegExp("\\b(" + word + ")\\b", "gi"), expr);

That should match 'Whole Words only'.

The second question, here you can check, if it's start of text or it follows a dot and either the end of text or a dot after it, like this:

text = text.replace(new RegExp("(^|\\.\\s?|,\\s?)(" + word.replace(/([\(\)'?*!"])/g, "\\$1") + ")($|\\.|,)", "gi"), expr);

The idea is that it should match a SENTENCE, not a line. And a sentence either starts at the start of the string or after a dot or a comma and it ends with either a dot, a comma or at the end of the string.

You should NOT use the the 'Multiline' option.

Edit2:

I have changed the Groups, I made, to non-capturing groupd, so they don't mesh up the Groups replacement. Now it's:

text = text.replace(new RegExp("(?:^|\\.\\s?)(" + word.replace(/([\(\)'?*!"])/g, "\\$1") + ")(?:\\.|,|$)", "gi"), expr);

Now it Works on fiddle.

Poul Bak
  • 10,450
  • 5
  • 32
  • 57
  • This makes sense, but if I do `text = text.replace(new RegExp("\b(" + word + ")\b", "gi")` it does not work if you try the snippet modified...why? I get both `qui` and `aliquip` – loretoparisi Oct 16 '18 at 17:23
  • It needs to be escaped: `\\b` – cuzi Oct 16 '18 at 17:26
  • yes it works ok when escaping `\\b`. The second point is about matching a whole line passed in the tokens array to to lines in the text, where the latter could be contained in some cases. I want only the exact matching. – loretoparisi Oct 16 '18 at 17:32
  • You could just link to one of millions of questions asking for word boundaries instead of duplicating content i.e [Regex match entire words only](https://stackoverflow.com/questions/1751301/regex-match-entire-words-only) – revo Oct 16 '18 at 17:35
  • 2
    Well, I wanted to answer both questions. – Poul Bak Oct 16 '18 at 17:45
  • @PoulBak I have updated the question with JSFiddle, I'm not sure the second solution works. The first one it is ok, thanks. – loretoparisi Oct 16 '18 at 18:01
  • hey @PoulBak first thank you! I have found out that I had to add a trailing `\n` in the replacement, because for some reason your solution two regex does not take it in account in the capture group, see this test: `out_phrases = replaceTokens(text, ["Ut enim ad minim veniam"], "\n$1", true)` where the phrase is in the middle. It seems that `\s` after a `\n` it changes the capture like in `\nUt enim ad minim veniam` and `.\n Ut enim ad minim veniam` and the same for punct like `\nUt enim ad minim veniam` and `\nUt enim ad minim veniam,` - note the last `,`. – loretoparisi Oct 26 '18 at 12:07