libclang: how to get token semantics

Question

libclang defines only 5 types of tokens:

CXToken_Punctuation
CXToken_Keyword
CXToken_Identifier
CXToken_Literal
CXToken_Comment

Is it possible to get a more detailed information about tokens? For example, for the following source code:

struct Type;
void foo(Type param);

I would expect the output to be like:

struct - keyword
Type - type name
; - punctuation
void - type/keyword
foo - function name
( - punctuation
Type - type of the function parameter
param - function parameter name
) - punctuation
; - punctuation

I also need to map those entities to file locations.

score 7 · Accepted Answer · answered Apr 23 '16 at 21:40

7

First, you probably need a bit of background on how parsing works. A textbook on compilers would be a useful resource. First, the file is converted into a series of tokens; that gives you identifiers, punctuation, etc. The code that does this is called a lexer. Then, the parser runs; this converts a list of tokens into an AST (structured declarations/expressions/etc.).

clang does keep track of the various parts of declarations and expressions, but not in the way you're describing. For a given function declaration, it keeps track of things like the location of the name of the function and the start of the parameter list, but it keeps those in terms of locations in the file, not tokens.

A CXToken is just a token; there isn't any additional associated semantic information beyond the five types you listed. (You can get the actual text of the token with clang_getTokenSpelling, and the location with clang_getTokenExtent.) clang_annotateTokens gives you CXCursors, which let you examine the relevant declarations.

Note that some details aren't exposed by the libclang API; if you need more detail, you might need to use clang's C++ API instead.

answered Apr 23 '16 at 21:40

Eli Friedman

2,343
1
13
11

1

Some links on where you base your observations on how Clang works will be very useful (not claiming you are wrong, in fact I have seen most of what you say while exploring libclang myself, merely pointing out how educational it would be having some links to support your remarks). – Yannis Apr 24 '16 at 20:17
I know how parsers work, I have implemented more than one. I'll add more details to my question later. Currently I'm using `clang_annotateTokens` but it returns cursors with unexpected `CXCursorKind`s. – piotrekg2 Apr 25 '16 at 09:26
@piotrekg2: If you truly know how parsers work, why do you expect the *token stream* to have non-token information like "function name" and so forth in it? That's parser-based information, not token-based. – Nicol Bolas Apr 27 '16 at 14:38
@NicolBolas Maybe I didn't express myself clear enough. What I wanted to achieve was a sequence of named entities. For me such an entity should be the lowest node from the AST containing the given token. Currently I'm using `clang_annotateTokens` to map tokens to cursors. However it looks like its implementation has some bugs. Please see http://lists.llvm.org/pipermail/cfe-dev/2012-May/021739.html . I think I've found more than those few bugs from the link. – piotrekg2 Apr 27 '16 at 14:53

score 2 · Answer 2 · answered Apr 24 '16 at 10:52

You're looking for the token spelling and location attributes exposed by libclang. In C++ these can be retrieved using the functions clang_getTokenLocation and clang_getTokenSpelling. A minimal use of these functions (using their python equivalents would be:

s = '''
struct Type;
void foo(Type param);
'''

idx = clang.cindex.Index.create()
tu = idx.parse('tmp.cpp', args=['-std=c++11'],  unsaved_files=[('tmp.cpp', s)],  options=0)
for t in tu.get_tokens(extent=tu.cursor.extent):
    print t.kind, t.spelling, t.location

Gives:

TokenKind.KEYWORD struct <SourceLocation file 'tmp.cpp', line 2, column 1>
TokenKind.IDENTIFIER Type <SourceLocation file 'tmp.cpp', line 2, column 8>
TokenKind.PUNCTUATION ; <SourceLocation file 'tmp.cpp', line 2, column 12>
TokenKind.KEYWORD void <SourceLocation file 'tmp.cpp', line 3, column 1>
TokenKind.IDENTIFIER foo <SourceLocation file 'tmp.cpp', line 3, column 6>
TokenKind.PUNCTUATION ( <SourceLocation file 'tmp.cpp', line 3, column 9>
TokenKind.IDENTIFIER Type <SourceLocation file 'tmp.cpp', line 3, column 10>
TokenKind.IDENTIFIER param <SourceLocation file 'tmp.cpp', line 3, column 15>
TokenKind.PUNCTUATION ) <SourceLocation file 'tmp.cpp', line 3, column 20>
TokenKind.PUNCTUATION ; <SourceLocation file 'tmp.cpp', line 3, column 21>

libclang: how to get token semantics

2 Answers2