3

I am using the following regex in vscode for a search and replace. It is to match an outer tag with plus 3 nested tags.

<tag>(((.|\n)*?)(</tag>)){4}

If i add any character to the end of this regex, vscode crashes. In my case i was going to specify a tag after that match.

Im pretty new to regex so trying to keep it simple.

I know this is a common problem and something to do with backtracking and i want to know how to simply this.

lloyd noone
  • 95
  • 10
  • 2
    Nested HTML and regex usually don't belong in the same sentence together. Can you show us an example of what you are trying to match using regex? – Tim Biegeleisen Jul 20 '20 at 16:30
  • I wish i could but its confidential. Its literally just a div with some more tags including divs inside it. I was figuring out this regex because its something i would need to do on a regular basis so i cant get too specific anyway. – lloyd noone Jul 20 '20 at 16:54
  • I don't think that tags with Lorum text is confidential, and any combi of HTML tags is also not confi – rioV8 Jul 20 '20 at 17:25
  • 1
    Use `Emmet:Balace outward` – rioV8 Jul 20 '20 at 17:26
  • 1
    Try `(?:[^<\r]*(?:<(?!/tag>)[^<]*)*){4}` – Wiktor Stribiżew Jul 20 '20 at 17:56
  • @WiktorStribiżew thanks! that seems to let me carry typing without crashing vscode. Presumably because it is more specific to html l rather than ```*```? Now i just need to figure out why that works. and experiment with adding a tag after that match in case it matches more than once. – lloyd noone Jul 20 '20 at 18:07
  • @rioV8 ive just looked up that Emmet shortcut, That it basically what i'm trying to accomplish using regex and i will also us that on a daily basis as well! – lloyd noone Jul 20 '20 at 18:17

1 Answers1

2

NEVER use (.|\n)*?. It is a very misfortunate, widely known pattern, that causes so much backtracking that it often leads to situations like this, when the text is long and specific enough to lead to catastrophic backtracking.

Note that even [\w\W]*? (or [\s\S\r]*?, see Multi-line regular expressions in Visual Studio Code) here might already suffice. Although it also involves quite a lot of backtracking, it will be much more efficient.

What can usually be used is an unrolled pattern, like

<tag>(?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4}

Instead of (.|\n)*?, a series of patterns are used so that each could only match distinct positions in a string.

Details

  • <tag> - a literal string
  • (?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4} - four repetitions of
    • [^<\r]* - 0 or more chars other than < (even line break chars, \r ensures this in VS Code regex, it enables all character classes in the pattern that can match newlines to match newlines (thus, \r is not necessary to use in the next character class))
    • (?:<(?!/tag>)[^<]*)* - 0 or more repetitions of a < not followed with /tag> and then 0 or more chars other than <.
    • </tag> - a literal </tag> string.

Having said that, you might also be interested in the Emmet:Balace outward:

A well-known tag balancing: searches for tag or tag's content bounds from current caret position and selects it. It will expand (outward balancing) or shrink (inward balancing) selection when called multiple times. Not every editor supports both inward and outward balancing due of some implementation issues, most editors have outward balancing only.

Emmet’s tag balancing is quite unique. Unlike other implementation, this one will search tag bounds from caret’s position, not the start of the document. It means you can use tag balancer even in non-HTML documents.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563