Accessing tokenization of a C++ source file

Question

My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like

x
+=
fun
(
nullptr
)
;

Is this true? If so, is there a way to have access to this tokenization of a C++ source code?

I'm asking this question mostly for curiosity, and I do not intend to write a lexer myself

And the reason I'm curious to know whether one can leverage the compiler is that, to give an example, before meeting [[noreturn]] & Co. I wouldn't have ever considered [[ as a valid token, if I was to write a lexer myself.

Do we necessarily need a true, actual use case? I think we don't, if I am curious about whether there's an existing tool or not to do something.

However, if we really need a use case,

let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of. Clearly, a requirement is that concatenating the elments of the output should make up the whole text again, including line breakers and every other byte of it.

Writing a *lexer* (or scanner) which does lexing (or tokenization) of simple C++ source files is not very hard. In fact I'd say it's very good exercise to do to learn a lot more about file handling, buffering, and string handling. — Some programmer dude, Oct 06 '20 at 06:16
@Someprogrammerdude I'm sure it's not hard, indeed. (I can think of it as a regex-based tool...) But I believe the compiler could give an always up to date answer. In other words,if I had written a lexer before C++11, it would break on rvalue references. If I had been stealing the lexer output from the compiler, it would always be up to date, no? — Enlico, Oct 06 '20 at 06:22
Well you could always look at the documentation for any specific compiler to see if it supports something like that. — Some programmer dude, Oct 06 '20 at 06:31
By the way, why do you need something like that? Is there an underlying problem you need to solve, or is it just plain curiosity? Curiosity is okay, but then please say so in the question. Otherwise please ask about the underlying problem directly, and bring up your thoughts about how to solve it. — Some programmer dude, Oct 06 '20 at 06:32
Please [edit](https://stackoverflow.com/posts/64220138/edit) your question to explain in written English why you are considering to parse C++ code. In most cases, it is not reasonable, and when you really need to do so, it would take a full year of full time work. — Basile Starynkevitch, Oct 06 '20 at 09:03
I've never said I wanted to parse a C++ code myself, but that I'd like to access the output of the lexer/scanner. I've updated the question. @BasileStarynkevitch, please, let me know if there's some Italian in it instead of English. By the way, +1 on you answer. — Enlico, Oct 06 '20 at 10:25
The canonical hard problem is `std::vector>`. How do you parse that `>>` ? The C++ standard is clear on how it should compile, but it does not say how your lexer should work internally. — MSalters, Oct 06 '20 at 10:33
@MSalters, I clarified that I don't want to write my own lexer. — Enlico, Oct 06 '20 at 10:36

score 3 · Accepted Answer · answered Oct 06 '20 at 10:54

3

With the restriction mentioned in the comment (tokenization keeping __DATE__) it seems rather manageable. You need the preprocessing tokens. The Boost::Wave preprocessor necessarily creates a token list, because it has to work on those tokens.

Basile correctly points out that it's hard to assign a meaning to those tokens.

answered Oct 06 '20 at 10:54

MSalters

173,980
10
155
350

Your suggestion seems promising, however I haven't been able to use [this example](https://www.boost.org/doc/libs/1_74_0/libs/wave/doc/quickstart.html). So far I've been able to determine that I need to `#include ` and `#include `, and compile with `g++ -std=c++17 -lboost_thread -lboost_filesystem -lboost_wave source.cpp`. Maybe you can help me? – Enlico Oct 06 '20 at 17:24
@enrico: if you describe what went wrong, you make it a lot easier for someone to help you. – rici Oct 06 '20 at 20:09
@rici, a full fledged question [here](https://stackoverflow.com/questions/64233293/how-can-i-run-wave-quick-start-example). – Enlico Oct 06 '20 at 20:22

Basile Starynkevitch · Answer 2 · 2021-11-21T08:44:57.177

C++ is a very complex programming language.

Be sure to read the C++11 draft standard n3337 before even attempting to parse C++ code.

Look inside the source code of existing open source C++ compilers, such as GCC (at least GCC 10 in October 2020) or Clang (at least Clang 10 in October 2020)

If you have to write your C++ parser from scratch, be sure to have the budget for at least a full person year of work.

Look also into existing C++ static source code analyzers, such as Frama-C++ or Clang static analyzer. Consider adapting one of them to your needs, but do document in writing your needs before starting coding. Be aware of Rice's theorem.

If you want to parse a small subset of C++ (you'll need to document and specify that subset), consider using parser generators like ANTLR or GNU bison.

Most compilers are building some internal representations, in particular some abstract syntax tree. Read the Dragon book for more.

I would suggest instead writing your own GCC plugin.

Indeed, it would be tied to some major version of GCC, but you'll win months of work.

Is this true? If so, is there a way to have access to this tokenization of a C++ source code?

Yes, by patching some existing opensource C++ compiler, or extending it with your plugin (there are licensing conditions related to both approaches).

let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of.

The above specification is ambiguous.

Do you want the lexeme before or after the C++ preprocessing phase? In other words, what would be the lexeme for e.g. __DATE__ or __TIME__ ? Read e.g. the documentation of GNU cpp ... If you happen to use GCC on Linux (see gcc(1)) and have some C++ translation unit foo.cc, try running g++ -C -E -Wall foo.cc > foo.ii and look (using less(1)...) into the generated preprocessed form foo.ii ? And what about template expansion, or preprocessor conditionals or preprocessor stringizing ?

I would suggest writing your GCC plugin working on GENERIC representations. You could also start a PhD work related to your goals.

Notice that generating C++ code is a lot easier than parsing it.

Look inside Qt for an example of software generating C++ code. Yo could consider using GNU m4, or GNU gawk, or GNU autoconf, or GPP, or your own C++ source generator (perhaps with the help of GNU bison or of ANTLR) to generate some of your C++ code.

PS. On my home page you'll find an hyperlink to some draft report related to your question, and another hyperlink to an open source program generating C++ code. It sadly seems that I am forbidden here to give these hyperlinks, but you could find them in two mouse clicks. You might also look into two European H2020 projects funding that draft report: CHARIOT & DECODER.

For the moderator zelots here: I am *contractually* obliged to mention the H2020 projects funding my work, sorry for the hyperlinks — Basile Starynkevitch, Oct 06 '20 at 08:53
I'd like it before the preprocessing. `__DATE__` should stand as it is. — Enlico, Oct 06 '20 at 10:37
@BasileStarynkevitch: then your contract conflicts with the requirements for this site. You can post them elsewhere, just not here. I'm sure your contract doesn't require you to be posting about your projects here, so if you just don't mention them you shoud be fine, right? — Martijn Pieters, Oct 06 '20 at 12:33
The real question is which laws apply to me: I am a French citizen, and I don't feel that US laws are relevant to me. In French this is called "impérialisme américain". I leave you to translate that expression — Basile Starynkevitch, Oct 06 '20 at 12:37
U.S. laws don't enter into it. These are rules of the StackExchange network. It is possible that French law obligates the StackExchange network to allow you to promote the H2020 project, but I doubt that this is the case. — Him, Nov 17 '21 at 17:16