2

Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.).

I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc. I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each found token.

For example I am looking at doing something like that:

int fac(int n)
{
    return (n>1) ? n∗fac(n−1) : 1;
}

from the source code above I would like to get:

function_name: 'fac' at position (x, y) variable_name: 'n' at position (x, y+8)

EDITED: Any suggestions will be appreciated since I am in the dark here regarding tokenizations and parsing in C++?

  • Are you talking about "function/class definitions" as in recognizing their syntax? If so, that's fundamentally a problem not suited for a tokenizer, and you need something that can handle contex-free grammars, i.e., a parser – en_Knight Apr 22 '16 at 20:00
  • Possible duplicate of [Tokenizer with Pygments in Python](http://stackoverflow.com/questions/36801263/tokenizer-with-pygments-in-python). You asked this question a few minutes ago! – ChrisP Apr 22 '16 at 20:00
  • @ChrisP I tried to expand it and differentiate it from my previous question in the sense that now I am detailing on another and more generic (perhaps) route. –  Apr 22 '16 at 20:02
  • Questions asking people to recommend a tool are off-topic. – ChrisP Apr 22 '16 at 20:03
  • @en_Knight I should have made it clearer, I will edit my question. To simply answer your point, no I do not want the syntax but merely to extract the name of it and -obviously- to identify that it is a function being defined at that line. –  Apr 22 '16 at 20:03
  • @ChrisP off-topic of what? I am asking for help and guidance. How is this off-topic, please explain. –  Apr 22 '16 at 20:04
  • @nk-fford that's what I was afraid of :) There's no way to know that it's a function being defined at that line with a scanner. That's heavily contextually dependent, and you need a parser. Google "parser generator", "state parser", or "top down" parser, then add python after :) – en_Knight Apr 22 '16 at 20:05
  • @nk-fford: See #4 on [What topics can I ask about here?](http://stackoverflow.com/help/on-topic): _"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow..."_ – ChrisP Apr 22 '16 at 20:07
  • @ChrisP I think this is salvageable, that's why I tried to answer; he's phrased it as if he's asking about a tool, which would be off topic, but I think he's really asking about why he's failing to tokenize in context, which is more on topic since it's directly related to compiler theory – en_Knight Apr 22 '16 at 20:10
  • @ChrisP Thank you for the link, I was not aware of that. The essence of my question was not about a specific tool though but I was in the 'dark' for my query. However I do understand the downvote. Thanks –  Apr 22 '16 at 20:15
  • @nk-fford I would change that last sentence - I'm not going to make the edit for you, but I think your phrasing asking "Does anyone know a tool/script that could do that" is a red flag for a lot of people. – en_Knight Apr 22 '16 at 20:18
  • @en_Knight I understand. Did it, thanks. ChrisP had a good point. However, thanks to your answer/discussion below I now have a better understanding of what I am looking for, so no regrets. Ready for my next question on the subject.. :) –  Apr 22 '16 at 20:21
  • Great, feel free to upvote or accept if any helped resolved your problem (you certainly don't have to, but you are able to); be sure to take a look at Austin's answer, it looks like it'll be helpful – en_Knight Apr 22 '16 at 20:23

2 Answers2

2

Eli Bendersky is a smart guy, and sometimes active here on SO. He's got a blog post on this issue which I'll refer you directly to: Parsing C++ in Python with Clang.

Because things disappear, here's the takeaway:

Eli Bendersky wrote a C language (not C++) parser in Python, called pycparser. People keep asking him if he's going to add support for C++. He is not. He recommends instead that people use the Python bindings for libclang to get access to "a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST)".

You can find the bindings separately on PyPI here. Note though that you'll have to have clang installed, so you may just want to point your PYTHON_PATH directly at the install location.

aghast
  • 14,785
  • 3
  • 24
  • 56
  • 1
    This is a good answer. I tried to address why OP wasn't able to do what he was trying to do, but this seems to have some more practical solution involved. +1 – en_Knight Apr 22 '16 at 20:21
1

You're struggling to find a python library to do what you want because what you want is impossible to do, fundamentally.

I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc

You mean like this:

foo = 3
def foo():pass

What is foo? All a tokenizer should/can tell you is that foo is an identifier. It's context tells you whether it's a variable or a function declaration. You need a parser to handle context free grammars. Mathematically, the space of context free grammars is too large for a standard lexer to tackle.

Try a parser: here's one in python

Normally I'd try and provide you links here to distinguish between the topics, but this is too broad to provide a single good link to. If you're interested, start with any standard compiler text. Elsewhere on SE, we see this question pop up as a theoretical question and, in some form, as a famous question about html.

Once you realize that tokenizers are (usually) built (largely) on regular expressions, it becomes more obvious why your task is not going to end happily.


Now that you know the terminology, I think you'll find this SO article useful, which recommends gcc-ml. I don't know how up-to-date it is, but it's the type of program you're looking for.

Community
  • 1
  • 1
en_Knight
  • 5,301
  • 2
  • 26
  • 46
  • Your points are really helpful; indeed I am searching in the dark because I was misunderstanding what I was looking for. So, are you aware of any parsers that handle C++ in the way I mention in my description? –  Apr 22 '16 at 20:13
  • Yes. Gcc is a good one :) All kidding aside, *any* parser can handle C++. I wouldn't try to unroll your own - C++ is a very complex language. The one I sited is a good one, but again, take an existing C++ compiler and just you the parse tree. Most compilers I am familiar with allow you to dump this information without fully compiling – en_Knight Apr 22 '16 at 20:14
  • @nk-fford see my edit, I think it has the type of thing you're looking for – en_Knight Apr 22 '16 at 20:20