and also I want to compare the result of concurrent parsing with serial parsing . How to do this? Which one is the simplest approach?
2 Answers
As others have said, YACC (based on LR parsing) by itself isn't "concurrent". (It is my understanding that YACC isn't even re-entrant, which will make it pretty hard to use it in multithreaded context no matter what you do. The non-reentrancy can presumably be overcome with mere sweat, so this is an annoyance, not a show stopper.)
One idea is to construct a pipeline, allowing a lexer to generate lexemes into a steam as fast as it can, and letting the actual parser read from the stream. This can get you at best a factor of 2. You might be able to do this with YACC relatively easily, modulo setting up communicating threads.
McKeeman et al have implemented parallel LR parsing by dividing the file into N chunks for N threads and claim to have gotten good results. The approach isn't simple, because dividing parsing of a single file into parallel chunks of about the same size, and stitching those chunks together, isn't easy. I doubt that one could easily hack up YACC to do this.
A screwy idea is parse a file from both ends toward the middle. Its easy enough to define the backward parser grammar from the "natural" forward one: just reverse all the grammar rule content. Nothing is easy; this might introduce ambiguities not present in the forward parser. This paper combines McKeeman's idea of breaking the file into chunks with bidirectional parsing of each chunk, enabling one to find a lot of parallelism on a big file.
Easier to do is to parallelize the parsing of individual files, using whatever parsing technology you have. This does parallelize relatively well, although the parsing time for individual files may not be even, so this is best done with some kind of worklist and a team parser threads taking work from this worklist. You could probably organize to do this with YACC.
Our DMS Software Reengineering Toolkit (DMS) uses a generalization of this when reading large systems of Java sources and classes. Parsing by itself isnt very useful; you almost always need to build symbol tables, too. DMS reading Java thus parallelizes both parsing, AST building and symbol table construction. Starting from a base set of filenames, it launches parses in parallel grains (multiplexed by DMS on top of OS threads); when a parser completes, the parsed tree is handed to name resolver, which splits into a parallel grain per parallel scope encountered. Nested scopes cause a tree of grains to be generated. Further work is gated by treating resolution of a scope as a future (event); while the scope is being resolved, more Java files parse/name resolution activities may be launched; when a scope is resolved, an event is signalled and grains waiting for scope completion can then inspect the scope content to resolve their own names. The tangle of (potential) parallelism in the middle of this is almost frightening :-} but is managed by DMS's underlying parallel programming language, PARLANSE, which uses work-stealing to balance the load across the threads.
Experience shows that doing this in production with 8 cores leads to a 5x speedup over sequential for a few thousand (typical/Java implies small) source files. We think we know where the bottleneck is; there are parts of the name resolver that are more expensive than it should be in terms of excess synchronization in an attribute grammar. I suspect we can get this closer to 8. So, parallel parsing and name resolution can be pretty effective.
We don't do quite as well with C++14, because of all the dependencies of individual files on #includes that it reads, often in different preprocessor configurations.

- 93,541
- 22
- 172
- 341
If the input allows splitting into chunks, like, e.g. log file lines, which can be parsed in parallel, then you can parse a certain number such chunks in parallel using a producer/consumer queue and then join the parse results.