Regex substitution: Replace texts, not codes

Question

I'm trying to solve a quiz of regex for days but still can't get it right. I'm getting so close but still can't get it to pass.

Task:

In an HTML page, replace the text micro with µ. Oh, and don't screw up the code: don't replace inside <the tags> or &entities;

Replace

micro -> µ
abc micro -> abc µ
micromicro -> µµ
µmicro -> µµ

Don't touch

<tag micro /> -> <tag micro />
µ -> µ
&abcmicro123; -> &abcmicro123;

I tried this but it fails on the last µ, what did I miss? Can someone point out what did I miss? Thanks in advance!

What I have tried:

Regex

((?:\G|\n)(?:.*?&.*?micro.*?;[\s\S]*?|.*?<.*?micro.*?>[\s\S]*?|.)*?)micro

Substitution

$1&micro;

This is really difficult using regexp. If you don't want to match in certain contexts you have to use negative lookbehind, but they're required to be a fixed size, so you can't make it not match anywhere after ` — Barmar, Oct 22 '20 at 06:46
It is a [quiz on regex101](https://regex101.com/quiz/21). I can feel it is really difficult to solve, but maybe I'm on the wrong track at the first place. Just need a hint on the right direction. — Hao Wu, Oct 22 '20 at 06:51
Good luck on that. Think of HTML comments, of script tags, of CDATA, of attributes having `>` in their value, etc, etc, ... As stated, regex is not the right tool for parsing HTML. — trincot, Oct 22 '20 at 07:00

Michail · Accepted Answer · 2020-10-22T17:13:10.290

1

You can try something like this:

(?:<.*?>|&\w++;)(*SKIP)(*F)|micro

replacement string:

µ

edited Oct 22 '20 at 17:13

answered Oct 22 '20 at 17:07

Michail

843
4
11

score 1 · Answer 2 · answered Oct 22 '20 at 20:45

Use SKIP-FAIL technique, but match as a whole word:

(?:<[^<>]*>|&\w+;)(*SKIP)(*F)|\bmicro\b

See proof

Explanation

--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    <                        '<'
--------------------------------------------------------------------------------
    [^<>]*                   any character except: '<', '>' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    >                        '>'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &                        '&'
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    ;                        ';'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (*SKIP)(*F)              Skip the match and go on matching from current location
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  micro                    'micro'
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

Didn't even know these flags exist. This makes things so much easier. Thanks for putting me on the right track and the detailed explanation! But I'm going to accept the other answer since he posted the answer earlier. — Hao Wu, Oct 23 '20 at 00:39
@HaoWu Yes, always accept what works for you best. And thank you for a nice question, +1. — Ryszard Czech, Oct 23 '20 at 19:57

score 0 · Answer 3 · answered Oct 22 '20 at 23:13

var strings = [
    "micro",
    "abc micro",
    "micromicro",
    "&micro;micro",
    "<tag micro />",
    "&micro;",
    "&abcmicro123;"
];
var re = /(?<!(<[^>]*|&[^;]*))(micro)/g;
strings.forEach(function(str) {
    var result = str.replace(re, '&$2;')
    console.log(str + ' -> ' + result)
});

Console log output:

micro -> &micro;
abc micro -> abc &micro;
micromicro -> &micro;&micro;
&micro;micro -> &micro;&micro;
<tag micro /> -> <tag micro />
&micro; -> &micro;
&abcmicro123; -> &abcmicro123;

Explanation:

use a (?<!...) - negative lookbehind to exclude micro inside tags or entities
(<[^>]*|&[^;]*) - inside negative lookahead skip over <...> OR '&...;'
(micro) - capture your tag (add multiple as needed, such as (micro|brewery))
'&$2;' - replacement turns the captured tag into an entity &...;

Unfortunately the pcre regex doesn't support non-fixed width lookbacks :( Or it will be so much easier — Hao Wu, Oct 23 '20 at 00:30