-1

Writing a regex that'll validate that some inputs are known link formats which I use on my site, an example would be /section/my-article-1?test=b

The requirements are

  • leading slash
  • the path just contains alfanumerics, dashes and slashes
  • queryparams are allowed

My regex is

/^((\/)[\dA-Za-z-]+)*(\/)?([&?=\dA-Za-z-])*$/;

This kinda works but it's not optimized.

Github CodeScan shows the warning 'Polynomial regular expression' https://codeql.github.com/codeql-query-help/java/java-polynomial-redos/

I assume that's because the groups [\dA-Za-z-] and [&?=\dA-Za-z-] potentially could overlap and cause slowness. But I'm unsure of how to improve it while still allowing queryparams.

How would I optimize the regex?

Here's some testdata I've used

SHOULD MATCH
/
/section
/section/article-1
/section/article-1/
/section/article-1?x=y&hello=world

SHOULD NOT MATCH
section/article-1
/section/!$*
/x(1)

PS: my current regex does allow multiple slashes after eachother, which is undesirable so preventing that would also be a bonus.

Drkawashima
  • 8,837
  • 5
  • 41
  • 52

1 Answers1

0

See the Recommendation:

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Change the pattern to such a regex that does not allow each subsequent part to match at the same location:

^(?:/(?:[0-9A-Za-z-]+/)*[&?=0-9A-Za-z-]*)?$

See the regex demo.

Here,

  • ^ - start of string
  • (?:/(?:[0-9A-Za-z-]+/)*[&?=0-9A-Za-z-]*)? - an optional non-capturing group:
    • / - a slash
    • (?:[0-9A-Za-z-]+/)* - zero or more repetitions of one or more alphanumeric or hyphen chars and then a / char
    • [&?=0-9A-Za-z-]* - zero or more alphanumeric, hyphen, =, ? or & chars
  • $ - end of string.

Just to show why it is efficent:

  • ^ - matches start of string
  • (?:/(?:[0-9A-Za-z-]+/)*[&?=0-9A-Za-z-]*)? - either starts matching with / that is at the start of string, or matches an empty string
    • (?:[0-9A-Za-z-]+/)* - requires alphanumeric/hyphen to appear after a / ([0-9A-Za-z-] does not match /), and ends matching with a / if matched
    • [&?=0-9A-Za-z-]* only matches the chars in the class if preceded with a / char that is missing from the character class (i.e. no overlapping here, either)
  • $ - matches end of string.

Now, if the warning is still there, you can safely ignore it.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563