9

Possible Duplicate:
Is there a regular expression to detect a valid regular expression?
Regular expression for finding a regular expression?

I have an application that enables the user to input a regular expression. How do I check against any input of regular expressions and make sure they are valid ones, because if they are not there there will be preg_match errors?

I don't want to use the '@' before preg_match, so if there's a way to check the validity of the user input of regular expressions that'd be great.

The regular expression system of PHP seems to be rather too complicated for me to come up with a regular expression for them.

Community
  • 1
  • 1
datasn.io
  • 12,564
  • 28
  • 113
  • 154
  • 3
    Depending upon the purposes of the app you might want to somehow set a timeout on this. Users can very easily intentionally or accidentally enter a regex that catastrophically back tracks and uses lots of server resources in the process. – Martin Smith May 07 '10 at 14:48

5 Answers5

30

Mathematically, it is impossible to validate a regular expression using a regular expression. This is so because (formal) regular expressions can only recognize regular languages. A language is any set of strings. For example, the set of all decimal numbers is a language (which by the way can be described using a regular expression); the set of all valid regular expressions is also a language. Regular languages are languages that require only fixed finite memory (not a function of the input size) to be recognized.

The language that contains all valid regular expressions is not a regular language; hence it is impossible to recognize a regular expression using a regular expression.

To understand this, notice that regular expressions contain parentheses in them that must match. Hence, if an "(" has occurred, a ")" must occur later on. This is impossible to describe with a machine that has only fixed finite memory. For, if there were a way to do it, and your regular expression had a finite memory of K different states (for some integer K), an expression with K opening parentheses followed by K closing parentheses, though a valid regular expression would have been unable to be recognized by that machine -- a contradiction (notice that in formal languages, our assumption is that text processing occurs one character at a time, from left to right, which is the same for applied regular expressions). We call languages such as the one that describes regular expressions context-free and not regular.

(It is trivial to prove that regular expressions do not form a regular language using the Pumping Lemma)

So, there is a fundamental computer science problem in recognizing regular expressions using regular expressions: It is mathematically impossible to do so.

Regular languages are possible to be recognized by finite-state automata, i.e. machines with finite states but without memory. To overcome your problem, you need to add some memory which is dependant on the input size. Regular expressions, as they are context-free (fortunately they're not some obscure, hard-to-recognize type of language) can be recognized in linear time using a push-down automaton. This is a "for" loop that goes through the expression one token (usually a character) at a time and keeps track of what it's seen on a stack, i.e. it "pushes" data that it laters "pops" in a first-in-last-out fashion. (Example of data pushed to the stack: "I need to remember to find a matching `)' later on!"; you can "push" this as many times as you need; you can "pop" it later, when you need to check if you actually needed to have matched an opening parenthesis previously).

Of course, writing your own recognition engine for regular expressions would be a bit of an overhead -- but if you want to do it, you should know the above limitations. It would be more wise to employ an already existing mechanism to do it -- I suspect you could give that job to a regular expression library or a language that is more keen on handling regular expressions such as Perl; but the @-method doesn't sound like too bad of an idea after all: It may be slow, but your users may input terribly slow regular expressions anyway; and it may be a bad practice, but in your case it seems the best possible solution available.

Some related articles in Wikipedia:

I hope this helped!

nickie
  • 5,608
  • 2
  • 23
  • 37
dionyziz
  • 2,394
  • 1
  • 23
  • 29
  • 3
    +1: Long answer, but this is the only CORRECT one. Regular expression syntax in PCRE is a context-free language. Also: Consider linking your article references to Wikipedia itself :-) (Sorry if I'm missing out on a ninja edit where you're already doing this) – Platinum Azure May 07 '10 at 18:31
  • Thanks for the suggestion! I added a small paragraph in the "Regular expressions" article under the "Implementations and running times" section with a reference to this post :) – dionyziz May 07 '10 at 18:52
  • 1
    The languages described by regular expressions are not necessarily regular (at least if we are talking about real world regular expressions, not the mathematical construct); specifically, backreferences in PCRE can be used to describe irregular languages like a^n b^m a^n. Some PCRE flavors allow recursion, conditional subpatterns, atomic groups etc. Also, PCRE is not a context-free language because named subgroups cannot reuse names. – Tgr May 07 '10 at 21:11
  • Tgr: Which is exactly why I said "formal regular expressions". To study PCRE formally, one would have to construct a formal language for it -- too much trouble for nothing if you ask me. In any case, PCRE is at BEST context-free (if we choose to use only some common aspects of it), so it certainly is not parsable using a finite-state automaton with fixed size memory, and the result still holds: Regular expressions cannot be validated using regular expressions. – dionyziz May 07 '10 at 23:50
10

preg_match() returns FALSE if an error occurred.

  1. send the expression to the server
  2. preg_match on an empty string
  3. see if an error occurs

You can either use Ajax to validate real time, or validate after form submit.
You can also try to validate by feeding the expression to javascript regexp engine, but js regexp syntax is not 100% compatible with the php one.

clyfe
  • 23,695
  • 8
  • 85
  • 109
  • 3
    Make sure you use a type-safe comparison: `preg_match(…) === false`. – Gumbo May 07 '10 at 15:05
  • 1
    Just clarifying: this correctly answers the question "Is there **a way** to detect a valid regular expression?". If you want a regular expression for that, it is impossible as explained by @dionyziz. – nickie Oct 19 '13 at 10:17
6

Letting users submit regular expressions is almost certainly a Bad Idea.

Some expressions are very expensive. Try this:

preg_match('/(.*){1,32000}[bc]/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

and that's just 30 characters of input! They don't all look like that either: /^(?:(\d+)|::)*$/ is also exponential-time in PCRE.

geocar
  • 9,085
  • 1
  • 29
  • 37
2

The first way that comes to mind is to use preg_last_error() after calling preg_match($sanatized_user_regex, ""); If you get anything other than PREG_NO_ERROR then respond with the appropriate error message.

Sean Vieira
  • 155,703
  • 32
  • 311
  • 293
2

Your question is a little ambiguous. Are you looking to validate the 'syntax' of the regular expression, or make sure the regular expression actually parses out content once applied to a string. I think in either case, you should leave the validation to the user (eg. provide a debug/textbox they can enter a string to match their regex against it. If there is something wrong with the regex or if no match is found, show a 'Not found' error).

In terms of validating the regex itself, you probably want to start with a simple validator that checks that only valid characters (eg. part of regex syntax like $,^ \t etc) are part of their regex, but I think trying to validate logical constructs within the regex might be rather complicated. Maybe there are some libraries out there that validate regex syntax, but I'm not aware of any.

idleworx
  • 107
  • 1
  • 1