2

I would like to parse a text file using C++. I know the syntax of the file and from the computer science point of view I dont think that I have any problems. However, I dont know exactly how to implement the parser in C++. I think there are a number of possibilities:

  1. flex/yacc: I think that the toolchain is a little outdated and I dont think that it would work very well with the rest of my program.

  2. plain C: I could read the entire file into one char array and use pointers for random access. The problem is that the text files might be huge and I really wouldnt want to store them in memory the whole time.

  3. C++ istreams: I think the problem here is that in the process of parsing the file I of couse need some kind of lookahead. If an expression doesn't match then I would of course have to put the chars that I read so far back into the stream. I think that this would become rather ugly using the ungetch function in C++. Also, since the expressions might be rather long, the peek function is probably inadequate for me.

  4. Using boost: Boost supplies regular expressions which would be perfect to recognize tokens, but as far as my research goes, it is not possible to match regular expressions and consume the tokens within the context of an istream.

I also used javacc with java a while back and I have to say that I was very impressed with it. However I don't think that there is anything like this in C++, is there?

I would really appreciate it if anyone with some experience in the area could point me in the right direction.

Exp
  • 206
  • 2
  • 5
  • 4
    It would be helpful if you gave examples like the most challenging thing you're trying to parse. Anyway, more than RegEx, Boost has a parser library called "Spirit": http://boost-spirit.com/ – HostileFork says dont trust SE Nov 01 '11 at 18:52
  • 1
    Some more requirements would help steer answers. What kind of text file are you talking about? Comma Separated Values? Some programming language? XML? Can you post a simple example? – Rian Sanderson Nov 01 '11 at 18:53
  • 2
    I wouldn't call the Flex/Bison chain "outdated", more like "stable, well-known and battle-tested". – Justin ᚅᚔᚈᚄᚒᚔ Nov 01 '11 at 18:53
  • Well, for #1 you should examine flex and yacc before concluding anything. And for #3 you could build a wrapper around istreams to provide reasonably efficient and easy-to-use lookahead. –  Nov 01 '11 at 18:54
  • 1
    Bison, etc. is dead; C? -- why even mention? C++ streams? -- how are they related to the parsing? -- Work on streambuf and copy into vector, you get random access iterators. With C++ you have two modern choices: boost::spirit -- works with many C++ compilers, fast execution, slow compilation; or AXE -- requires C++11 compiler, soon to be released under boost license. – Gene Bushuyev Nov 01 '11 at 19:03

2 Answers2

1

If this is true:

plain C: I could read the entire file into one char array and use pointers for random access. The problem is that the text files might be huge and I really wouldnt want to store them in memory the whole time.

You should look into memory mapped files.

Iczelion has a good tutorial on the Windows API for memory mapped files here.

POSIX provides mmap(). Beej is apprently back online at a new address and provides an example of use here.

Boost also provides a single way to use the above in a platform independent way. I don't know much about it because i would rather write something like this myself. I am sure it has it's advantages. Boost has a page about it here.

Stack Overflow has a question about parsing a mmap()ed file here.

Community
  • 1
  • 1
Joe McGrath
  • 1,481
  • 10
  • 26
0

You might also consider ANTLR as a parser generator.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547