Can the shunting yard algorithm parse POSIX regular expressions?

Question

At first glance, the shunting yard algorithm seems applicable to POSIX regular expression parsing, but since I don't have much experience (or theoretical background) in writing parsers, I'd like to ask SO before jumping in and writing something only to get stuck halfway.

Perhaps a more sophisticated version of the question is: What is a good formal statement of the class of problems the shunting yard algorithm can be applied to?

Clarification: This question is about whether you can parse POSIX re syntax into an abstract syntax tree using the basic principles of the shunting algorithm, not whether you can use regular expressions to implement the shunting algorithm. Sorry I wasn't clear enough stating that to begin with!

When you talk about parsing a regex, do you mean tokenizing the string describing the regular language? Or do you mean executing the finite-state automaton that it represents? Or something else? — Gabe, Nov 12 '10 at 04:48
I mean building an AST representing the regular expression. Converting that AST into an automaton to match the regex is another problem, I know. — R.. GitHub STOP HELPING ICE, Nov 12 '10 at 04:51

score 2 · Answer 1 · edited Jun 22 '15 at 05:58

2

I'm fairly sure it can. If you look at Henry Spencer's regular expression package:

regexp.shar.Z

which was the basis for Perl's regular expressions, you will notice that he describes the program as being in "railroad normal form".

edited Jun 22 '15 at 05:58

Mateen Ulhaq

24,552
19
101
135

answered Nov 12 '10 at 09:40

score 1 · Answer 2 · answered Nov 12 '10 at 09:54

1

I reckon you'd have some problems because different characters have different meanings in different contexts e.g.

^[^a-z][asd-]

The ^ has two different meanings and so does the -. I think I'd choose a recursive descent parser.

answered Nov 12 '10 at 09:54

JeremyP

84,577
15
123
161

That's what I thought at first, but if it's possible to efficiently determine the context from the stack/queue state, treating these special should not be a big deal, right? – R.. GitHub STOP HELPING ICE Nov 12 '10 at 10:08
By the way, `[^a-z]` and `[asd-]` are both individual tokens. There's no reason to treat them as anything more complex at the parser level. They get their chance to be special when it comes time to build FA. – R.. GitHub STOP HELPING ICE Nov 12 '10 at 10:25
@R.. I'm not saying it's impossible or even really hard, but as soon as you find yourself having to walk the stack, you are probably going the wrong way. I just think it would be cleaner as a recursive descent parser. – JeremyP Nov 12 '10 at 10:28
@R.. : If you are tokenising at that level, then my issue probably goes away and the shunting algorithm becomes reasonably simple to implement. – JeremyP Nov 12 '10 at 10:43
What I like about shunting yard is that the very predictable performance and memory usage. It's all in a single buffer (`4*strlen(regex)` should suffice) and failure cases are purely initial allocation failure or invalid syntax. There's no possibility of runaway stack usage or having to backtrack and free a partially-built AST on failure. – R.. GitHub STOP HELPING ICE Nov 12 '10 at 21:42

score 0 · Answer 3 · answered Nov 12 '10 at 09:31

I don't see why it wouldn't be suitable. Looking at some old code, it does seem I used a completely different parsing strategy for my last regexp parser, however (essentially, a walk-through from the start, building the resulting automaton representation as you go, with some look-ahead and recursive calls to implement grouping of regular expressions).

score -1 · Answer 4 · edited May 23 '17 at 10:27

I will say that the answer to your question is "no, you cannot implement the shunting yard algorithm using a regular expression." This is for the same reason you cannot parse arbitrary HTML using regular expressions. Which boils down to this:

Regular expressions do not have a stack. Because the shunting yard algorithm relies on a stack (to push and pop operands as you convert from infix to RPN), then regular expressions do not have the computational "power" to perform this task.

This glosses over many details, but a "regular expression" is one way to define a regular language. When you "use" a regular expression, you are asking the computer to say: "Look at a body of text and tell me whether or not any of those strings are in my language. The language that I defined using a regular expression." I'll point to this most excellent answer which you and everyone reading this should upvote for more on regular languages.

So now you need some mathematical concept to augment "regular languages" in order to create more powerful languages. If you were to characterize the shunting yard algorithm as an realization of a model of computational power, then you might say that the algorithm would be described as a context-free grammar (hey what do you know, that link uses an expression parse tree as an example.) A push-down automata. Something with a stack.

If you are less-than-familiar with automata theory and complexity classes, then those wikipedia articles are probably not that helpful without explaining them from the ground up.

The point being, you may be able to use regex to help writing shunting yard. But regex are not very good at doing operations that have an arbitrary depth, which this problem has. So I would not spend too much time going down the regex avenue for this problem.

I think you misread my question. I'm not asking to implement shunting yard with regular expressions. I'm asking if it's possible to parse a regular expression (into an AST) using a variant of the shunting algorithm. — R.. GitHub STOP HELPING ICE, Nov 12 '10 at 04:44
Regardless, you've written a nice answer, albeit to a different question, so I hope you don't just delete it! — R.. GitHub STOP HELPING ICE, Nov 12 '10 at 04:45
@poundifdef, here is the tribute question to your nice answer. hint-hint: http://stackoverflow.com/questions/9298076/can-you-use-regular-expressions-to-implement-the-shunting-algorithm — Prof. Falken, Feb 15 '12 at 17:21

Can the shunting yard algorithm parse POSIX regular expressions?

4 Answers4

Linked