-2

I have the following text

This MUST should be caught, but not this one **MUST** because it is between **

The idea is that I will be running a search/replace on some files, several times, and I would like the replace to be idempotent. This is because some MUST may have already been changed into **MUST** and I do not want to end up with ******MUST****** after a few runs.

To do that I tried to build a regex that says "match MUST but not if it is surrounded by **":

(?!\(\*\*\))MUST(?!\(\*\*\)) 

(inspired by another question, regex101 playground).

This however matches both MUST.

WoJ
  • 27,165
  • 48
  • 180
  • 345
  • Where/how are you going to do the replacement? – VLAZ May 17 '21 at 09:56
  • @VLAZ: I will probably write a Python script (or a Go program), but I am open to any suggestions. This is not something critical, rather a way to make sure that some documentation is more or less consistent. – WoJ May 17 '21 at 09:57
  • Maybe `(?<!\*\*)MUST(?!\*\*)` or `\b(?<!\*\*)MUST\b(?!\*\*)`? It will work in Python but not in Go. – Wiktor Stribiżew May 17 '21 at 09:59
  • `(?<!\*\*\)MUST(?!\*\*\)` should work – anubhava May 17 '21 at 09:59
  • 1
    OK, just making sure you have access to backreferences. You can use [The Best Regex Trick](https://www.rexegg.com/regex-best-trick.html) (scroll to **The Best Regex Trick Ever (at last!)** to skip the preamble if you're not interested) - you can match `\*\*MUST\*\*|(MUST)` and replace only when you have a group match with `**$1**` – VLAZ May 17 '21 at 09:59
  • @anubhava Splitting hairs, but let's just say that `***MUST***` with _three_ stars on each side _should_ be targeted for replacement. The problem with the single regex approach is that it might open up some edge cases (which admittedly may be very unlikely). – Tim Biegeleisen May 17 '21 at 10:09
  • @TimBiegeleisen: May be `(?<!\S)\bMUST\b(?!\S)` is better in that case – anubhava May 17 '21 at 10:12
  • `(?<!\S)\bMUST\b(?!\S)` is the same as `(?<!\S)MUST(?!\S)`, as whitespace boundaries are word boundaries, too. – Wiktor Stribiżew May 17 '21 at 10:18
  • WoJ, please just confirm if you want to find `MUST` in a string like `This **is something I MUST replace**, too.`. Or, if you want to only avoid matching if `**` enclose the word immediately on the left and right. – Wiktor Stribiżew May 17 '21 at 10:27
  • @WiktorStribiżew: Thank you - I did not think about that case. It is not likely it will happen so if this is complicated to handle it I will find another way. Ultimately this is Markdown so `This **is something I MUST replace**, too` should stay as it (as it is already bolded). This is a good catch, thanks. – WoJ May 17 '21 at 11:00
  • You simply should use a markdown parser, or write a parsing code for that. You can't do that safely with regex. – Wiktor Stribiżew May 17 '21 at 11:03
  • @WiktorStribiżew: the MD will be parsed afterward. My problem is that I get some docs from various places, some have highlighted what they were supposed to highlight, some not. I wanted to make a coherent set of docs by adding the highlights where they should be. I also needed to match with regex open HTML tags except XHTML self-contained tags - but that's another question. – WoJ May 17 '21 at 12:11
  • Oh, please, [do not ask it](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) :) – Wiktor Stribiżew May 17 '21 at 12:26
  • @WiktorStribiżew: incredible, there is an answer to everything in SO :) (I got reminded of Tony the Pony a few days ago when reading a Meta discussion about the new usage of native fonts in SE, when one of the strong requirements was to not break this answer :)) – WoJ May 17 '21 at 12:46

2 Answers2

-1

There is probably a way to do this using pure regex, but if you have access to Python, then you may use re.sub along with a callback function as follows:

inp = "This MUST should be caught, but not this one **MUST** because it is between **"
output = re.sub(r'\*\*MUST\*\*|\bMUST\b', lambda x: '**MUST**' if x.group() == '**MUST**' else 'MUST NOT', inp)
print(output)

This prints:

This MUST NOT should be caught, but not this one **MUST** because it is between **

The strategy here is that we simply match both **MUST** and MUST (the latter as a standalone word). Then, in the callback function, we no-op for the **MUST** matches, but selectively replace with MUST NOT for the standalone MUST matches.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    Easier to make your desired match a capture group `|\b(MUST)\b'` then you can check if there was any capture and replace just that. Any of the other alterations are irrelevant, so you can add as many as you want like `\*\*MUST\*\*|""MUST""|--MUST--|` etc. that you want to discard. – VLAZ May 17 '21 at 10:03
  • Meh. The logic I have above is very clear, and using a capture group trick, while perhaps saving some code, makes it more complex. – Tim Biegeleisen May 17 '21 at 10:04
  • Beside what @VLAZ mentions, in cases where a simple string replacement pattern can be used (it looks like it is the case here) the string replacement should be preferred, as using a lambda/method in the replacement has a certain performance penalty. – Wiktor Stribiżew May 17 '21 at 10:08
-1

In this case instead of skipping the * characters, you can always replace them, so multiple replaces would still result in the same string. You can do a straight forward

Search: \**\b(MUST)\b\**
Replace: **$1**

In pretty much any tool you want including most text editors. Search for all instances of MUST as a separate word possibly surrounded with any amount of asterisks, then replace the whole thing with **MUST**.

Regex 101 demo of the replacement

This does mean that if you have *MUST* or similar you will also match them but if your delimiters are expected to be unique, then replacing them every is still an idempotent operation.

Runnable demo in JavaScript

document.querySelector("#replace")
  .addEventListener("click", e => {
    const container = document.querySelector("#text");
    const text = container.value;
    
    container.value = text.replace(
      /\**\b(MUST)\b\**/g,
      "**$1**"
    );
  });
#text {
  width: 500px
}
<textarea id="text">This MUST should be caught, but not this one **MUST** because it is between ** but MUSTANG or WORDMUST are not replaced</textarea>
<br />
<button id="replace">Replace</button>
VLAZ
  • 26,331
  • 9
  • 49
  • 67