7

The problems parsing C++ are well known. It can't be parsed purely based on syntax, it can't be done as LALR (whatever the term is, i'm not a language theorist), the language spec is a zillion pages, etc. For that and other reasons I'm deciding on an alternative language for my personal projects.

Vala looks like a good language. Although providing many improvements over C++, is just as troublesome to parse? Or does it have a neat, reasonable length formal grammar, or some logical description, suitable for building parsers for compilers, source analyzers and other tools?

Whatever the answer, does that go for the Genie alternative syntax?

(I also wonder albeit less intensely about D and other post-C++ non-VM languages.)

DarenW
  • 16,549
  • 7
  • 63
  • 102
  • 2
    Features are a good reason to choose some specific language for a project, but what does it matter how difficult that language is to parse? (Unless your personal project is writing a compiler for said language). On that note, C++ is not an LR(1) grammar, like Java and C# are, and can potentially involve infinite lookahead. – wkl Nov 27 '10 at 04:52
  • Vana 'should' be saner than C++. I know Java is REALLY sane, I've used a Java Parser written in Java and generated by a compiler-compiler directly from the grammar in EBNF. – cadolphs Nov 27 '10 at 04:59
  • 1
    Err.. it *can* be parsed based on syntax, at least insofar as any language could be considered that way (of course, things like identifiers are technically context sensitive, but they are common across most languages). Yes the grammar is not LALR(1), but it's of course parsable. On the other hand, difficulty of parsing really shouldn't be your main criterion for choosing a language -- there's a lot to be said for popularity (and therefore ease of aquiring libraries and such) for a given language. – Billy ONeal Nov 27 '10 at 05:15
  • Java is way too "sane" for my taste! – DarenW Nov 27 '10 at 06:25
  • My "personal projects" include tools to analyze source code. I've got various crazy ideas in the back of my mind, for which parsing or otherwise analyzing source code is a part. – DarenW Nov 27 '10 at 06:46
  • 5
    *Parsing is easy*. People that want to build programming tools seem to focus on parsing as *the* problem. It is only *one* of many problems, and in fact it is the easiest since it is well solved. The harder part is acquiring the language semantics and doing something with them. I characterize this as the "climbing Everest" problem; parsing gets you to the foothills and that step is sort of easy. Going from the foothills to the peak requires a whole new class of technology, engineering and sweat (see my bio for what that might look like). – Ira Baxter Nov 27 '10 at 17:13
  • 2
    Parsing C++ is in fact undecidable http://stackoverflow.com/questions/794015/what-do-people-mean-when-they-say-c-has-undecidable-grammar – Josh Lee Nov 28 '10 at 06:16
  • 2
    For those wondering why parse-ability might be important, you might enjoy "The Role of the Study of Programming Languages in the Education of a Programmer" by Daniel Friedman (see https://www.cs.indiana.edu/~dfried/ ). – Weston C Nov 28 '10 at 08:12
  • 2
    @jleedev: You seem to be casting FUD. The mentioned SO article is more nuanced than what is implied in your short comment. As a *practical* matter, *parsing* C++ is decidable or programmers would stop using C++ compilers. The confusion over local ambiguity vs templates is a non-issue. Local ambiguity of certain phrases is easily solved with standard parser technology (I use GLR in our parser and that works fine). Templates computing halting predicates are different but in practice nobody writes those anyway, and the compilers put finite limits on template processing to boot. – Ira Baxter Nov 28 '10 at 14:48
  • @Ira My apologies. I am actually curious about IntelliSense to see what ambiguities it can handle and what it might choke on. – Josh Lee Nov 28 '10 at 17:56
  • 1
    @jleedev: If the issue is, "what syntax can be predicated to come next", ambiguity in the grammar doesn't matter because you get the same sequence of surface characters. So a parser which can handle ambiguity for C++ could provide fine intellisense-like hints about what can come next. I don't know how Intellisense actually works, but I hear it has the Edison Design Grooup parser behind it; that should be good enought to handle the local ambiguity if the left context contains sufficient type information. – Ira Baxter Nov 29 '10 at 08:45
  • 1
    IntelliSense merely needs to know what name can follow a `->` or `.`. This means it has to resolve the name on the left of the `->` or `.` In this specific context, there's no lookahead needed at all, let alone an infinite lookahead. The type of the expression preceding `->` is entirely independent of the token following `->`. – MSalters Nov 29 '10 at 11:47

2 Answers2

8

C++ is one of the most complex (if not the most complex) programming language to parse in common use. Of particular difficulty is it's name lookup rules and template instantiation rules. C++ is not parsable using a LALR(1) parser (such as the parsers generated by Bison and Yacc), but it is by all means parsable (after all, people use parsers which have no problem parsing C++ every day). (In fact, early versions of G++ were built on top of Bison's Generalized LR parser framework Actually not, see comments) before it was more recently replaced with a hand written recursive descent parser)

On the other hand, I'm not sure I see what "improvements" Vala offers over C++. The languages look to attempt to accomplish the same goals. On the other hand, you're probably not going to find much outside of GTK+ written with Vala interfaces. You're going to be using C interfaces to everything else, which really defeats the point of using such a language.

If you don't like C++ due to it's complexity, it might be a good idea to consider Objective-C instead, because it is a simple extension of C, (like Vala), but has a much larger community of programmers for you to draw upon given it's foundation for everything in Mac land.

Finally, I don't see why the difficulty of parsing the language itself has to do with what a programmer should be caring about in order to use the language. Just my 2 cents.

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • 1
    +1 for that last part, parsability shouldn't even be a concern for the average developer. – Evan Teran Nov 27 '10 at 06:16
  • I doubt that g++ was based on the Generalized LR parser infrastructure of Bison: that one has been added in version 1.5 in 2002. AFAIK, the pre 3.4 parser was LALR based with more or less clean hacks to make it handle C++. – AProgrammer Nov 27 '10 at 08:16
  • @AProgrammer: I'm afraid I don't understand. C++ simply isn't parsable using LALR. How could one "hack" in support for that? – Billy ONeal Nov 27 '10 at 16:51
  • The early versions of GCC parsed C++ by using Bison operating as an pure *LALR(1)* parser, with some pretty awful hacks involving symbol tables and type information to resolve the ambiguities. (See my answer http://stackoverflow.com/questions/243383/why-c-cannot-be-parsed-with-a-lr1-parser/1004737#1004737 for more details as to how this works). Newer versions use a hand-written rescursive descent... and I think the same essential hacks to avoid lookahead. – Ira Baxter Nov 27 '10 at 17:02
  • 2
    As far as "awful languages to parse" go, lots of them get my vote: PHP for sheer poor definition; mainframe Natural for having keywords everywhere that can sometimes be identifiers, sometimes not; and if you want to go nuts, try parsing any dynamic HTML in which people mix fragments of JavaScript with HTML chunks and client-side code that in some 3rd language (C#, JSP, ...) that manufactures bits of JavaScript. (If you want to *analyze* an HTML page, you need to do stuff like this). – Ira Baxter Nov 27 '10 at 17:08
  • @Billy, I've not looked at the old g++ parser but I'm worked on programs using hacks to make a LALR parser generator handle some language which isn't LALR. Those involves playing with the tokenizer: having it return different things for the same text depending on the context, having it returns dummy token which are coming from the parser and not the input, using it to implement backtracking (refeeding the same input but after having changed the context so it is tokenized differently). When you start that, you aren't really using a parser generator, but a strange PL and anything is possible. – AProgrammer Nov 27 '10 at 18:22
  • Also, Vala is a nice language to learn because it presents C using idioms and syntax familiar to Java and .Net programmers. Coming from most languages, I would think Objective C syntax would be bizarre. – weberc2 Jun 05 '13 at 15:07
  • +1 for the Objective-C reference, people tend to forget it does actually exist outside of the Mac ecosystem. – Ephemera Oct 20 '13 at 05:11
6

It's pretty simple. You can use libvala to do both parsing, semantic analyzing and code generation instead of writing your own.

lethalman
  • 1,976
  • 1
  • 14
  • 18