3

Let's say one, for parsing purposes, wants to skip the contents of the template identifier list:

template<(valid code)>
        ^            ^
        | from       | to

The first thing that comes to mind would be to blindly find the first >, but that doesn't always work:

template<(valid code),template<(valid code)> >
        ^                                  ^
        | from                             | to, oops

A better approach is to skip pairs of < and > recursively:

template<(valid code),template<(valid code)> >
        ^                                    ^
        | from                               | to, better

However, even this approach fails for arcane but valid masterpieces like this one (from bits\random.h, line 69, GCC 4.7.x):

template<(...),bool = __w < static_cast<size_t>(...)>
        ^                 ^            ^      ^     ^     ^
        | 1               | 2          | 3    | 2   | 1   | where did 0 go?

My question is then, what is a correct way to find the end of any valid template identifier list?

abyss.7
  • 13,882
  • 11
  • 56
  • 100
Orwell
  • 1,468
  • 1
  • 13
  • 28
  • 1
    There is a lot of potential problems with bruteforce parsing - like, macros, splitted lines, etc. Use a complete AST parser, i.e. libclang. – abyss.7 Nov 10 '13 at 16:06
  • This question is more about the general algorithm, not about parsing pitfalls. Thanks for the libclang suggestion though. – Orwell Nov 10 '13 at 17:04

2 Answers2

2

There's really no easy way to find the end of any valid template identifier list, because there are even more pathological possibilities than the example from bits/random.h. (There's a an example of a pathological case at the end of this answer: Is C++ context-free or context-sensitive? in which an identifier is a template or not depending on whether a largish integer is prime.)

The algorithm is easy to state (it's in paragraph 3 of §14.2 of the C++11 standard, [temp.names]). Basically, a < is a template bracket if and only if it follows a template identifier, the word template or one of the cast operators. Once you have an open <, it's matched by the first > which is not nested inside some parenthetic syntax (including braces and brackets) and which does not match some nested template bracket. In C++11, this includes > which are part of a >> token.

As an example, had the random.h example said bool = __w > ..., the > would have been considered a closing template bracket. But since it was a < and __w is not a template identifier, it's treated as a less-than operator. IMHO good style is always to wrap comparison and shift operators in parentheses if they're inside of template brackets, but that's just me.

The precise wording of the standard:

After name lookup finds that a name is a template-name or that an operator-function-id or a literal operator-id refers to a set of overloaded functions any member of which is a function template if this is followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than operator. When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter rather than a greater-than operator.

The algorithm is hard to implement because there is no easy way to know what identifiers are template identifiers because you need to parse the entire program (at least, up to the point of use) to know what kind of thing each identifier is. And that includes finding and parsing included files, and otherwise preprocessing the source code.

If you really want it to be accurate, you need to use a C++ parser and be aware that it is potentially quite a slow process, since it can include template instantiation (as with the cited prime-check example). If you're just trying to do something like syntax colouring, you can probably get away with an approximation.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
1

You would need to skip anything inside of () brackets.
You need to look at the identifier before the < and know if it's a type name or not.

This second one is the problematic one, you need to scan through all your source code and identify all the names of class/struct/unions and typedefs so that when you reach an expression of the form __w < __a (simplified for this example) you know whether or not __w names a type.
The reason this gets problematic is if you encounter any Preprocessor Metaprogramming (like Boost's library on the subject) you're basically stuck writing a program that can evaluate these to see what type names were created.
Furthermore, consider this piece of code (a little more complicated than necessary to show how hard it is):

    template <template <class> class T>
    struct Base
    {
        template <class X>
        using type = T<X>;
    };

    template <>
    struct Base<std::numeric_limits>//Let's assume numeric_limits only has one template argument for this example
    {
        static const int type = 0;
    };

    template <class T, Base<T>::type < 0>
    struct OtherClass{};

This kind of complexity is what the typename keyword is for, since Base<T> is a dependent scope, you can't tell right away whether Base<T>::type names a type or a member variable. In this way the grammar requires that Base<T>::type be a static member variable and typename Base<T>::type be a type (we don't know what kind of type, but that's okay we're only trying to find the end of the template identifier list).
Further to the example above you also need to handle the lesser known use of the template keyword. Since Base<T> is a dependent scope, how do you parse Base<T>::type<0>? This depends on what type is. The grammar, in this case, requires that Base<T>::type be a member variable and be parsed as (Base<T>::type < 0) >, and Base<T>::template type<0> is a template expression.

The keywords typename and template, while difficult to wrap your mind around until you understand dependent scopes, make your job easier for you in the end. Otherwise, the simple question of where does a template identifier list end becomes nigh-impossible to answer without writing a full compiler (and even then the compiler is much more difficult to write).

Dependent scopes aside, you still need to be able to handle non-dependent scopes. This means you need to know whether Base<numeric_limits>::type is a type or not, which means scanning every struct/class/union for typedefs and resolving public inheritance for base class typedefs and private inheritance + using statements.

If you restrict yourself to code that compiles, your job remains tractable, otherwise you have a lot more work ahead of you.

I don't promise this is everything, only that this will in all likelihood keep you busy for a while.

SirGuy
  • 10,660
  • 2
  • 36
  • 66