When should I use a parser?

Question

I have had problems in Regexes to divide a code up into functional components. They can break or it can take a long time for them to finish. The experience raises a question:

"When should I use a parser?"

Not exactly sure if it's a duplicate -- but check the following posts out: * [When Is A Problem Too Complex For A Regular Expression?](http://stackoverflow.com/questions/230517/when-is-a-problem-too-complex-for-a-regular-expression) * [Alternatives To Regular Expressions](http://stackoverflow.com/questions/514313/alternatives-to-regular-expressions) — dirkgently, Apr 11 '09 at 12:35

lc. · Accepted Answer · 2009-04-11T12:48:00.527

You should use a parser when you are interested in the lexical or semantic meaning of text, when patterns can vary. Parsers are generally overkill when you are simply looking to match or replace patterns of characters, regardless of their functional meaning.

In your case, you seem to be interested in the meaning behind the text ("functional components" of code), so a parser would be the better choice. Parsers can, however, internally make use of regex, so they should not be regarded as mutually exclusive.

A "parser" does not automatically mean it has to be complicated, however. For example, if you are interested in C code blocks, you could simply parse nested groups of { and }. This parser would only be interested in two tokens ('{' and '}') and the blocks of text between them.

However, a simple regex comparison is not sufficient here because of the nested semantics. Take the following code:

void Foo(bool Bar)
{
    if(Bar)
    {
        f();
    }
    else
    {
        g();
    }
}

A parser will understand the overall scope of Foo, as well as each inner scope contained within Foo (the if and else blocks). As it encounters each '{' token, it "understands" their meaning. A simple search, however does not understand the meaning behind the text and may interpret the following to be a block, which we of course know is not correct:

{
    if(Bar)
    {
        f();
    }

dfa · Answer 2 · 2009-04-11T14:42:24.230

3

you need a parser when:

language is not regular (wikipedia)
you need a parse tree (more generally when you need to execute actions contextually)
when the resulting regular expression is too obscure/complex

My 2 cents.

edited Apr 11 '09 at 14:42

answered Apr 11 '09 at 14:10

dfa

114,442
31
189
228

I don't mean to nitpick, but for point 1, do we need a lexer or a parser ? – Parag Jul 24 '12 at 12:20

score 2 · Answer 3 · answered Apr 11 '09 at 12:39

There are a few compelling use cases for parsers over regular expressions. You should use a parser instead of a regular expression:

Whenever the kinds of expressions you'd like to work with are more complex than few semantic entities (tags, variables, phone numbers, etc.).
Whenever you need to know the semantic meaning of text instead of merely matching a pattern. For example, if you're trying to match all possible ways of writing a phone number, a parser is probably better than a regex. If you're trying to match a specific pattern that happens to correspond to a phone number, a regex is probably fine.
Whenever input can't be guaranteed to be well-formed.
If you're working entirely within the structure of a well-defined language that has a syntax specification (C#, XML, C++, Ruby, etc.), there's already going to be a parser, so you have some work done for you.

@John Feminella, I could be wrong, but I am not sure I agree with the phone number example. If we want to match various ways of writing a phone number, I think it can still be very well represented as a regex (with an optional list of patterns). This may not be a very good example of a case when semantics are needed. — Parag, Jul 24 '12 at 12:18
@Parag: I wish I still had the blissful inner peace that comes from believing phone numbers can be matched with regular expressions. Phone numbers are dreadfully complicated to validate fully. — John Feminella, Jul 24 '12 at 14:56

Andrei Vajna II · Answer 4 · 2009-04-13T09:50:53.133

2

The Dragon Book has a small section about what you can't use Regular Expressions for:

They can't detect repetition of a string, meaning you can't match constructs like 'wcw', where w is the same succesion of symbols
You can only detect a fixed number of repetition or an unspecified number of repetitions, which is to say you can't use an already parsed token to determine the number of repetitions, something like: 'n s1 s2 ... sn'
"Regular Expressions can't be used to describe balanced or nested constructs, [like] the set of strings of all balanced parentheses"

For 1 and 2, there's a simple explanation, you can't capture a substring so you can match it later. If you would, than you would be using a parser. Just think of how you would be using regular expressions for those cases, and you will intuitively come to the conclusion you can't. :)

For 3, it's the same as the problem in K&R for parsing string literals. You can't just say a string literal is between the first ' " ' and the second ' " ', but what happens when there's an escaped quote(\")?

As for the relation to Russel's paradox, I think you're hunch is right, because the problem is regex's limited introspection capabilities. The book has references to the proofs. If you want to, I can look them up for you.

edited Apr 13 '09 at 09:50

answered Apr 12 '09 at 23:09

Andrei Vajna II

4,642
5
35
38

What are the premises for each argument? 1. no inference about itself 2. since memory is limited, tokens must be finite 3. all -- I don't know why but when I read the writing I started to think about Russell's paradox. Can you reduce their proofs to it? – Léo Léopold Hertz 준영 Apr 13 '09 at 01:03
@Asdrei Vajna II Please, have a try at "%s@\\(h\\(el\\)lo\\)@the string is \1 and the substring is \2 @", when you have only a line with a word "hello". – Léo Léopold Hertz 준영 Apr 13 '09 at 11:32
When I thought about the paradox, I think the definitions for "string" and "substring" are different, what I presumed in my comment. When I matched a substring, I did not match the whole string "\\(h\\(el\\)lo\\)". When I match a whole string "/\\(h\\(el\\)lo/\\)", I would not match a substring "el". – Léo Léopold Hertz 준영 Apr 13 '09 at 11:38
And when you try to prove each case, you need to reduce the problem to similar situations. When I think about a parser, I spontaneously start to think the Russel thing. It hard for me to think what each writer is meaning by words like string or substring. – Léo Léopold Hertz 준영 Apr 13 '09 at 11:42
@Asdrei Vajno II: Can you post the page numbers? I ordered the Dragon book a some time ago. Hopefully, it precisely defines terms. – Léo Léopold Hertz 준영 Apr 13 '09 at 11:44
It's page 98, Chapter 3.3 - "Specification of Tokens", the last section of the chapter, called "Nonregular Sets". – Andrei Vajna II Apr 13 '09 at 14:59

score 1 · Answer 5 · answered Apr 11 '09 at 12:36

You need to use a parser as soon as you have a problem regular expressions is not meant to, (or simply can't) solve. Matching (un)balanced parenthesis (recursively) for instance is one of those problems. Eventhough some flavours, like PCRE, get you very far they don't win over a hand written parser.

score 1 · Answer 6 · answered Apr 11 '09 at 13:09

1

Here are some use cases, courtesy of Steve Yegge: Rich Programmer Food.

answered Apr 11 '09 at 13:09

Yuval F

20,565
5
44
69

+1 for the personal blog post. I bought three books about compilers, recurrence and such related things after reading =) – Léo Léopold Hertz 준영 Apr 11 '09 at 20:52
Thanks. In that case, take a look at: http://stackoverflow.com/questions/725372/which-programming-languages-text – Yuval F Apr 12 '09 at 03:50

score 0 · Answer 7 · answered Apr 11 '09 at 12:36

Your question is a bit vague, but I guess my opinion is that when your regex becomes complicated or takes too long, and you have a reasonably defined "language" to deal with, a parser will be easier.

I don't think you can set a line in the sand and say that anything on one side can be done by regex, and on the other side you need a parser. It depends on the situation.

score 0 · Answer 8 · answered Apr 12 '09 at 23:19

There are things that regex cannot do while parser can do.
For example:

Start ::= (Inner);
Inner ::= Start | x;

Regular expression wouldn't be able to do that because regex can't track if there are same number of open and close parenthesis. That is why when you are trying to tokenize and parse a large file, parser is expected to be used, while regex can simply find special pattern(s) inside the file.

When should I use a parser?

8 Answers8

Linked