0

I am having below regex pattern and getting catastrophic backtracking issue while testing.

my regex is:

(\s|\S)*((\%3C)|<)((\%2F)|\/)*[a-zA-Z0-9\%]+((\s|\S)*)+((\%3E)|>)(\s|\S)*

testing with string:

<<<<<<<<<<<<<<<<<<<<<fdslkjdskldsj dsfdlskhfdskhds dskfhdskjfhdsjkfhhaskdfffffshs

Please suggest me solution what wrong I have done in my regex pattern.

revo
  • 47,783
  • 14
  • 74
  • 117
  • Can you please tell us what your regex is supposed to be doing? And then show us sample data we can use to test it? – Tim Biegeleisen May 21 '18 at 11:24
  • Relevant: https://stackoverflow.com/a/1732454/3791827 – ACascarino May 21 '18 at 11:26
  • here is sample string <<<<<<<<<<<<<<<<<<<< – Mayank Sahay May 21 '18 at 11:27
  • What programming language or tool? What are you trying to match? – revo May 21 '18 at 11:51
  • Can you explain what the purpose of this regex/sample is? – CoronA May 21 '18 at 12:20
  • hi i m using java as programming language and eclipse tool.This expression is written by someone else earlier in my team so i m not sure what it is matching.i am having one screen which is having comment section and in that comment section this regex is used .so when i click on save button of the screen it keep moving and not getting saved. – Mayank Sahay May 21 '18 at 12:21
  • 1
    @MayankSahay It looks like the person was trying to match HTML tags in either a normal (eg. ) or URL-encoded form (eg. %3C%2Fa%3E) – Zak May 21 '18 at 14:33

1 Answers1

0

The original regex seems to be trying to match simple HTML tags - either in plain text or URL-encoded form – without attributes that are embedded in multi-line text. eg This is some <b>bolded</b> text.

If so, then something like this:

(?i)[\s\S]*?((?:<|%3C)(?:\/|%2F)?[a-z\d]+(?:>|%3E))[\s\S]*?

would capture the tags within the text (and wouldn't fail when given the input you provided.)

Explained bit by bit, here's how it works:

(?i)        -- makes the regex case insensitive
[\s\S]      -- is an atom to match any character - even newlines
*?          -- match 0 or more, but as few as possible, of the previous atom
(           -- start a group that captures all matches until the closing )
(?:         -- start a group that doesn't capture it's contents
<|%3C)      -- match either a < or %3C (which is the URL-encoded version of <)
(?:\/|%2F)? -- optionally match a / or %2F
[a-z\d]+    -- match one or more letters or numbers
(?:>|%3E)   -- match the closing tag
)           -- close the open group
Zak
  • 1,042
  • 6
  • 12
  • ..yes it is working after removing (\s|\S)* .I mean i make my expression like (\s|\S)*((\%3C)|<)((\%2F)|\/)*[a-zA-Z0-9\%]+((\%3E)|>)(\s|\S)*.So my concern is the expression will give same output as earlier and also it will remove backtracking issue??? – Mayank Sahay May 21 '18 at 13:43
  • Hi @MayankSahay - each `(\s|\S)*` is a problem, not just the one that you removed. – Zak May 21 '18 at 14:20