Parsing custom format file in C#

Question

I need to parse a custom file format with C#. The file format is a PBX file of Xcode project. There is no official documentation on the format. But it's rather straightforward. Here is the simple example:

// !$*UTF8*$!
{
    archiveVersion = 1;
    classes = {
    };
    objectVersion = 46;
    objects = {

        /* Begin PBXBuildFile section */
        5143B90C1884374800F27FD8 /* Foundation.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = 5143B90B1884374800F27FD8 /* Foundation.framework */; };
        5143B90E1884374800F27FD8 /* CoreGraphics.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = 5143B90D1884374800F27FD8 /* CoreGraphics.framework */; };
        5143B9101884374800F27FD8 /* UIKit.framework in Frameworks */ = {isa = PBXBuildFile; fileRef = 5143B90F1884374800F27FD8 /* UIKit.framework */; };
        /* End PBXBuildFile section */
    };
    rootObject = 5143B9001884374800F27FD8 /* Project object */;
}

In objects section there is a sequence of object definitions: object unique id followed by its properties. You can see comments here. Also property values can be enclosed in quotes.

The complete example of PBX file is here.

Now I need to build DOM of the file. What is the best approach to solve this kind of tasks?

This would be a good start: http://stackoverflow.com/questions/7557273/tutorial-or-guide-for-scripting-xcode-build-phases Of course, it depends on your requirements. What is the tool going to be for? Do you expect that the format is very fixed (ie. can you always assume that the line `objects = {` will not be written with `{` on the next line etc.)? If it's just some internal tool, you could probably get away with simply reading line by line and parsing in a simple stupid way (like `if (line.IndexOf("objects") != -1) ...`). — Luaan, Jan 13 '14 at 16:26
Define best. You could do a grammar and parse it. Or you could loop through line by line and chop strings, and test for particular strings... — Tony Hopkinson, Jan 13 '14 at 16:26
It shouldn't rely on a specific positions of spaces and line breaks. It's a grammar. And I'm looking for a guide to parse it. — alexey, Jan 13 '14 at 16:53
It would be also cool if the parser can be used in a way XmlReader in .net works: reading token by token. — alexey, Jan 13 '14 at 16:56

greenoldman · Accepted Answer · 2014-01-28T17:58:05.840

Using parser (because of nested braces regex is no-go). Pick the one you feel OK with syntax:

ANTLR, LLLPG,
GOLD, Irony, Coco/R, NLT (my own),
or mentioned Sprache.

I guess you are new to this, so this is why I grouped those -- top down approach, bottom up and combinator one. My personal preference is bottom up, the definition of mathematical expressions feels more natural for me, but here you should not have that kind of problem.

Starting 2014-01-28 NLT includes PBXProj files simple reader.

score 0 · Answer 2 · answered Jan 13 '14 at 21:08

0

I've found that Sprache project is really good for this type of grammars.

For simple parsing cases Regex's can be enough too.

answered Jan 13 '14 at 21:08

alexey

8,360
14
70
102

score 0 · Answer 3 · answered Jan 13 '14 at 22:58

0

I use the Regex classes when they are suitable, but for more structured data like you've shown here I would turn to ANTLR as documented here for C#.

answered Jan 13 '14 at 22:58

Sam Harwell

97,721
20
209
280

score 0 · Answer 4 · answered Jan 14 '14 at 00:13

0

If you need to be able to match nested braces, regexes will not work. You could use a parser generator like ANTLR, but this format looks simple enough to write your own recursive descent parser.

Before we could show you how to write the parser we would need to know what kind of DOM you want to output.

answered Jan 14 '14 at 00:13

Dour High Arch

21,513
29
75
90

Why write a recursive descent parser by hand, when ANTLR provides similar (or better) performance with a much higher degree of maintainability? – Sam Harwell Jan 14 '14 at 05:12

Parsing custom format file in C#

4 Answers4