Catastrophic backtracking issue solution

Question

I am having below regex pattern and getting catastrophic backtracking issue while testing.

my regex is:

(\s|\S)*((\%3C)|<)((\%2F)|\/)*[a-zA-Z0-9\%]+((\s|\S)*)+((\%3E)|>)(\s|\S)*

testing with string:

<<<<<<<<<<<<<<<<<<<<<fdslkjdskldsj dsfdlskhfdskhds dskfhdskjfhdsjkfhhaskdfffffshs

Please suggest me solution what wrong I have done in my regex pattern.

Can you please tell us what your regex is supposed to be doing? And then show us sample data we can use to test it? — Tim Biegeleisen, May 21 '18 at 11:24
What programming language or tool? What are you trying to match? — revo, May 21 '18 at 11:51
hi i m using java as programming language and eclipse tool.This expression is written by someone else earlier in my team so i m not sure what it is matching.i am having one screen which is having comment section and in that comment section this regex is used .so when i click on save button of the screen it keep moving and not getting saved. — Mayank Sahay, May 21 '18 at 12:21
@MayankSahay It looks like the person was trying to match HTML tags in either a normal (eg. ) or URL-encoded form (eg. %3C%2Fa%3E) — Zak, May 21 '18 at 14:33

Zak · Answer 1 · 2018-05-21T14:56:16.780

The original regex seems to be trying to match simple HTML tags - either in plain text or URL-encoded form – without attributes that are embedded in multi-line text. eg This is some <b>bolded</b> text.

If so, then something like this:

(?i)[\s\S]*?((?:<|%3C)(?:\/|%2F)?[a-z\d]+(?:>|%3E))[\s\S]*?

would capture the tags within the text (and wouldn't fail when given the input you provided.)

Explained bit by bit, here's how it works:

(?i)        -- makes the regex case insensitive
[\s\S]      -- is an atom to match any character - even newlines
*?          -- match 0 or more, but as few as possible, of the previous atom
(           -- start a group that captures all matches until the closing )
(?:         -- start a group that doesn't capture it's contents
<|%3C)      -- match either a < or %3C (which is the URL-encoded version of <)
(?:\/|%2F)? -- optionally match a / or %2F
[a-z\d]+    -- match one or more letters or numbers
(?:>|%3E)   -- match the closing tag
)           -- close the open group

..yes it is working after removing (\s|\S)* .I mean i make my expression like (\s|\S)*((\%3C)|<)((\%2F)|\/)*[a-zA-Z0-9\%]+((\%3E)|>)(\s|\S)*.So my concern is the expression will give same output as earlier and also it will remove backtracking issue??? — Mayank Sahay, May 21 '18 at 13:43
Hi @MayankSahay - each `(\s|\S)*` is a problem, not just the one that you removed. — Zak, May 21 '18 at 14:20

Catastrophic backtracking issue solution

1 Answers1