3

If you ask a question about parsing HTML with regex, you will certainly be referenced to this famous rant. Though there is not a canonical rant for it, I've also been told that regex aren't powerful enough to parse SQL.

I'm a self-taught programmer, so I don't know much about languages from a theoretical perspective. Practically speaking, what are examples of languages or grammars that regex can always parse successfully?

To be specific, I'd really like a few examples of languages that are used in the real world that fit in the category of regular languages, rather than some axioms or equivalent conditions, etc.

Community
  • 1
  • 1
Eric Wilson
  • 57,719
  • 77
  • 200
  • 270
  • Causing bigger problems than the one you originally wanted to solve in the first place. – Brandon Moretz Jul 15 '11 at 21:11
  • possible duplicate of [What are good regular expressions?](http://stackoverflow.com/questions/4954/what-are-good-regular-expressions) – Daniel DiPaolo Jul 15 '11 at 21:14
  • 7
    @Daniel DiPaolo Not at all a duplicate of that question – Eric Wilson Jul 15 '11 at 21:17
  • 1
    @FarmBoy the crux of the possible duplicate is "For what do you use regular expressions?" which is exactly the same question as you are asking - reword your title if you're not looking for what regular expressions are good for and only want to ask about regular languages/grammars – Daniel DiPaolo Jul 15 '11 at 21:19
  • You are right, it was a poor title. – Eric Wilson Jul 15 '11 at 21:21
  • It depends upon your definition of _"regular expression"_. e.g. Perl, PHP/PCRE and .NET all support matching non-"regular" nested structures (nested to any arbitrary depth, limited only by memory). All the modern regex engines like these have gone way, _way_ beyond _REGULAR_ regular expressions! – ridgerunner Jul 15 '11 at 21:39
  • 1
    @ridgerunner, it's not just the languages that support recursive patterns: all languages that support back-references, like: Python, Ruby, Java etc., also fall in the category of regex implementations that parse/match more than regular languages. The regex `(.)\1` is non-regular. – Bart Kiers Jul 15 '11 at 21:43
  • See also http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes for "modern regular expressions", which are quite different from "regular expressions" in the classic sense which can only parse regular languages. – Dietrich Epp Jul 15 '11 at 22:14

4 Answers4

3

Regexes are great for parsing things with only repetitions. They go wrong when you have forms of recursion. I think most useful is showing the simplest language it can't parse:

n open parenthesis followed by n close parenthesis, so for instance: (()) and ((((()))))

If you know you cannot parse that, you can easily conclude that you cannot parse most programming languages.

So I think you could parse basic SQL (though not if you would allow stuff like subqueries). Other prime examples of regex-parseable strings are web adresses, email-adresses, phonenumbers, etc.

If you're looking for actual programming languages which one can parse using regexes you won't find many (though I think (from my limited knowledge of it) parsing assembly should be possible. Most uses however are found in parsing simple strings, or lines.

markijbema
  • 3,985
  • 20
  • 32
0

I have used regex extensively for report processing. PERL backronymed to (Practical Extraction and Report Language) has been used extensively to parse reports from *nix systems. I have used AWK extensively (which is about as close to a regex only language as you can get) for decades to parse out logs, reports, etc.

Regex, like any other computer language/function is a tool in a toolbox. It can parse HTML, it can Parse SQL but to what level and how well was the regex coded. No tool is going to be perfect but if you use the right tool for the right job you'll always have a plethora of them available.

Mech
  • 2,904
  • 3
  • 24
  • 29
0

In short, regexps can't parse structures with unknown level of nesting (like HTML). Because most regexp engines are based on finite state machine. This limits you expression to address only predefined number of states.

You can still parse HTML with regexp, but you can't get things like current path to element in the tree.

Ivan Nevostruev
  • 28,143
  • 8
  • 66
  • 82
-1

They are great for input validations. They are great for parsing well structured data files.

They are not great for parsing a language like html or sql, but they can be used for splitting a language into the relevent tokens.

Regex are often misused and they have a reputation for being difficult to use and understand. Much of this reputation is well earned but not all of it.

Use them for simple cases. Get comfortable with them in the simple cases, and the more complex cases will make more sense. Walk before you run.

Nick Harrison
  • 921
  • 9
  • 17