Extracting class from demangled symbol

Question

I'm trying to extract the (full) class names from demangled symbol output of nm using boost::regex. This sample program

#include <vector>

namespace Ns1
{
namespace Ns2
{
    template<typename T, class Cont>
    class A
    {
    public:
        A() {}
        ~A() {}
        void foo(const Cont& c) {}
        void bar(const A<T,Cont>& x) {}

    private:
        Cont cont;
    };
}
}

int main()
{
    Ns1::Ns2::A<int,std::vector<int> > a;
    Ns1::Ns2::A<int,std::vector<int> > b;
    std::vector<int> v;

    a.foo(v);
    a.bar(b);
}

will produce the following symbols for class A

Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > >::A()
Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > >::bar(Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > > const&)
Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > >::foo(std::vector<int, std::allocator<int> > const&)
Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > >::~A()

I want to extract the class (instance) name Ns1::Ns2::A<int, std::vector<int, std::allocator<int> > > preferably using a single regular expression pattern, but I have problems to parse the recursively occuring class specifiers within the <> pairs.

Does anyone know how to do this with a regular expression pattern (that's supported by boost::regex)?

My solution (based on David Hammen's answer, thus the accept):

I don't use (single) regular expressions to extract class and namespace symbols. I have created a simple function that strips off bracketing character pairs (e.g. <> or ()) from the tail of symbol strings:

std::string stripBracketPair(char openingBracket,char closingBracket,const std::string& symbol, std::string& strippedPart)
{
    std::string result = symbol;

    if(!result.empty() &&
       result[result.length() -1] == closingBracket)
    {
        size_t openPos = result.find_first_of(openingBracket);
        if(openPos != std::string::npos)
        {
            strippedPart = result.substr(openPos);
            result = result.substr(0,openPos);
        }
    }
    return result;
}

This is used in two other methods that extract the namespace / class from the symbol:

std::string extractNamespace(const std::string& symbol)
{
    std::string ns;
    std::string strippedPart;
    std::string cls = extractClass(symbol);
    if(!cls.empty())
    {
        cls = stripBracketPair('<','>',cls,strippedPart);
        std::vector<std::string> classPathParts;

        boost::split(classPathParts,cls,boost::is_any_of("::"),boost::token_compress_on);
        ns = buildNamespaceFromSymbolPath(classPathParts);
    }
    else
    {
        // Assume this symbol is a namespace global function/variable
        std::string globalSymbolName = stripBracketPair('(',')',symbol,strippedPart);
        globalSymbolName = stripBracketPair('<','>',globalSymbolName,strippedPart);
        std::vector<std::string> symbolPathParts;

        boost::split(symbolPathParts,globalSymbolName,boost::is_any_of("::"),boost::token_compress_on);
        ns = buildNamespaceFromSymbolPath(symbolPathParts);
        std::vector<std::string> wsSplitted;
        boost::split(wsSplitted,ns,boost::is_any_of(" \t"),boost::token_compress_on);
        if(wsSplitted.size() > 1)
        {
            ns = wsSplitted[wsSplitted.size() - 1];
        }
    }

    if(isClass(ns))
    {
        ns = "";
    }
    return ns;
}

std::string extractClass(const std::string& symbol)
{
    std::string cls;
    std::string strippedPart;
    std::string fullSymbol = symbol;
    boost::trim(fullSymbol);
    fullSymbol = stripBracketPair('(',')',symbol,strippedPart);
    fullSymbol = stripBracketPair('<','>',fullSymbol,strippedPart);

    size_t pos = fullSymbol.find_last_of(':');
    if(pos != std::string::npos)
    {
        --pos;
        cls = fullSymbol.substr(0,pos);
        std::string untemplatedClassName = stripBracketPair('<','>',cls,strippedPart);
        if(untemplatedClassName.find('<') == std::string::npos &&
        untemplatedClassName.find(' ') != std::string::npos)
        {
            cls = "";
        }
    }

    if(!cls.empty() && !isClass(cls))
    {
        cls = "";
    }
    return cls;
}

the buildNamespaceFromSymbolPath() method simply concatenates valid namespace parts:

std::string buildNamespaceFromSymbolPath(const std::vector<std::string>& symbolPathParts)
{
    if(symbolPathParts.size() >= 2)
    {
        std::ostringstream oss;
        bool firstItem = true;
        for(unsigned int i = 0;i < symbolPathParts.size() - 1;++i)
        {
            if((symbolPathParts[i].find('<') != std::string::npos) ||
               (symbolPathParts[i].find('(') != std::string::npos))
            {
                break;
            }
            if(!firstItem)
            {
                oss << "::";
            }
            else
            {
                firstItem = false;
            }
            oss << symbolPathParts[i];
        }
        return oss.str();
    }
    return "";
}

At least the isClass() method uses a regular expression to scan all symbols for a constructor method (which unfortunately doesn't seem to work for classes only containing member functions):

std::set<std::string> allClasses;

bool isClass(const std::string& classSymbol)
{
    std::set<std::string>::iterator foundClass = allClasses.find(classSymbol);
    if(foundClass != allClasses.end())
    {
        return true;
    }

std::string strippedPart;
    std::string constructorName = stripBracketPair('<','>',classSymbol,strippedPart);
    std::vector<std::string> constructorPathParts;

    boost::split(constructorPathParts,constructorName,boost::is_any_of("::"),boost::token_compress_on);
    if(constructorPathParts.size() > 1)
    {
        constructorName = constructorPathParts.back();
    }
    boost::replace_all(constructorName,"(","[\\(]");
    boost::replace_all(constructorName,")","[\\)]");
    boost::replace_all(constructorName,"*","[\\*]");

    std::ostringstream constructorPattern;
    std::string symbolPattern = classSymbol;
    boost::replace_all(symbolPattern,"(","[\\(]");
    boost::replace_all(symbolPattern,")","[\\)]");
    boost::replace_all(symbolPattern,"*","[\\*]");
    constructorPattern << "^" << symbolPattern << "::" << constructorName << "[\\(].+$";
    boost::regex reConstructor(constructorPattern.str());

    for(std::vector<NmRecord>::iterator it = allRecords.begin();
        it != allRecords.end();
        ++it)
    {
        if(boost::regex_match(it->symbolName,reConstructor))
        {
            allClasses.insert(classSymbol);
            return true;
        }
    }
    return false;
}

As mentioned the last method doesn't safely find a class name if the class doesn't provide any constructor, and is quite slow on big symbol tables. But at least this seems to cover what you can get out of nm's symbol information.

I have left the regex tag for the question, that other users may find regex is not the right approach.

Doesn't `nm` come with a `--demangle` option? Why reinvent the round peg? — Kerrek SB, Sep 16 '12 at 12:33
@KerrekSB I'm already using the demangled symbols, I want to extract the class names from them. — πάντα ῥεῖ, Sep 16 '12 at 12:35
Oh OK. Looks like the template syntax does no describe a regular language, though. It's more like XML (and [we all know where that is going](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)). — Kerrek SB, Sep 16 '12 at 12:40
@KerrekSB , Re "Doesn't nm come with a --demangle option?" : Not necessarily; on some systems you have to pipe the output from `nm` to `c++filt`. In any case, it looks like g-makulik is trying to parse class names from those demangled function names produced by `nm` (or `nm | c++filt`). — David Hammen, Sep 16 '12 at 12:43
@KerrekSB I've read s.th. about recursively matching patterns as extension of the REE, but I'm not sure how to use these and if they're supported by `boost::regex`. — πάντα ῥεῖ, Sep 16 '12 at 12:43

score 2 · Accepted Answer · answered Sep 16 '12 at 13:17

This is hard to do with perl's extended regular expressions, which are considerably more powerful than anything in C++. I suggest a different tack:

First get rid of the things that don't look like functions such as data (look for the D designator). Stuff like virtual thunk to this, virtual table for that, etc., will also get in your way; get rid of them before you do you the main parsing. This filtering is something where a regexp can help. What you should have left are functions. For each function,

Get rid of the stuff after the final closing parenthesis. For example, Foo::Bar(int,double) const becomes Foo::Bar(int,double).
Strip the function arguments. The problem here is that you can have parentheses inside the parentheses, e.g., functions that take function pointers as arguments, which might in turn take function pointers as arguments. Don't use a regexp. Use the fact that parentheses match. After this step, Foo::Bar(int,double) becomes Foo::Bar while a::b::Baz<lots<of<template>, stuff>>::Baz(int, void (*)(int, void (*)(int))) becomes a::b::Baz<lots<of<template>, stuff>>::Baz.
Now work on the front end. Use a similar scheme to parse through that template stuff. With this, that messy a::b::Baz<lots<of<template>, stuff>>::Baz becomes a::b::Baz::Baz.
At this stage, your functions will look like a::b:: ... ::ClassName::function_name. There is a slight problem here with free functions in some namespace. Destructors are a dead giveaway of a class; there's no doubt that you have a class name if the function name starts with a tilde. Constructors are a near giveaway that you have a class at hand -- so long as you don't have a namespace Foo in which you have defined a function Foo.
Finally, you may want to re-insert the template stuff you cut out.

Thanks for your answer, I'm already following this direction now but still using RE. I've tried to avoid writing a parser for the `()` `<>` pair matching stuff, but it really seems to be the better approach instead of RE. — πάντα ῥεῖ, Sep 16 '12 at 13:33
It really isn't. Look at the perl module Text::Balanced. It has the full power of perl regular expressions right at hand, and yet it still uses a counting mechanism. — David Hammen, Sep 16 '12 at 16:14
Thank's for the hint David. Seems that simple stripping of bracketing character pairs is much more promising for the analysis. I'll post the solution as soon I'm satisfied with the results. I'm also trying to extract the namespaces, so I have at least to consider the point of looking up a class constructor method to distinguish (nesting) classes from namespaces. — πάντα ῥεῖ, Sep 16 '12 at 16:30
@g-makulik - It looks like you are trying to build a code analysis tool without having to build a parser for C++. (Do that and you are getting close to having a compiler. Big chore for C++!) As an alternative to your `nm` analysis, you might want to look at using the clang libraries. With this you do have a parser. You can use parts of clang to do the lexing and parsing and use the API to query what the lexer/parser found. — David Hammen, Sep 17 '12 at 11:38
Not really, I just want to 'analyze' the nm output to be able to group size summaries for namespaces and classes. This should be helpful for optimizing on small targets (finding out where most of the code/data resides). Going to lookup the class destructor to distinguish classes from namespaces seems to be the better choice indeed, but I have come over cases (I guess when class/struct contains neither data nor vtable) there's no default constructor/destructor emitted at all. I had this for pure functor classes for example. — πάντα ῥεῖ, Sep 17 '12 at 13:14

PiotrNycz · Answer 2 · 2012-09-16T17:59:07.907

1

I did the extraction with simple C++ function.

See link for full code, the idea behind is:

There are basic level tokens separated by ::.
If there are N basic level tokens, first N-1 describe className, the last is function
We go up level (+1) by ( or <
On closing ) or > we go down one level (-1)
Basic level means of course - level == 0

I have strong feeling that this cannot be done by regular expressions since we have unlimited levels of brackets. I have 255 in my function - can switch to std::stack<char> for unlimited levels.

The function:

std::vector<std::string> parseCppName(std::string line)
{
   std::vector<std::string> retVal;
   int level = 0;
   char closeChars[256];

   size_t startPart = 0;
   for (size_t i = 0; i < line.length(); ++i)
   {
      if (line[i] == ':' && level == 0)
      {
          if (i + 1 >= line.length() || line[i + 1] != ':')
             throw std::runtime_error("missing :");
          retVal.push_back(line.substr(startPart, i - startPart));
          startPart = ++i + 1;
      }
      else if (line[i] == '(') {
         closeChars[level++] = ')';
      } 
      else if (line[i] == '<') {
         closeChars[level++] = '>';
      } 
      else if (level > 0 && line[i] == closeChars[level - 1]) {
         --level;
      }
      else if (line[i] == '>' || line[i] == ')') {
         throw std::runtime_error("Extra )>");
      }
   }
   if (level > 0)
       throw std::runtime_error("Missing )>");
   retVal.push_back(line.substr(startPart));
   return retVal;
}

edited Sep 16 '12 at 17:59

answered Sep 16 '12 at 17:52

PiotrNycz

23,099
7
66
112

I agree with your feeling about using regular expressions to accomplish this. I've developed a parser based on @David Hammen's hints so far. May be I'll come back to your proposal when I see need for improvement of my current solution. Also note my comment about the namespace extraction use case. – πάντα ῥεῖ Sep 16 '12 at 19:17
I think that this will be very hard to distinguish between nested classes and namespaces without remembering all the lines. After parsing all lines - every N-1 parts (as given by my function) name a class. Other are namespaces. But this will be broken by empty classes, I mean classes without functions, c-tors and d-tors. – PiotrNycz Sep 16 '12 at 19:35
Actually I do store all the input lines and lookup for a constructor symbol to find 'real' classes within these. That will not cover any classes which only contain static functions and have no (even no default) constructor. But this will be still fine for my purposes. – πάντα ῥεῖ Sep 16 '12 at 20:24
This could break your algorithm: `namespace foo { void foo() {} }` – PiotrNycz Sep 16 '12 at 20:32
I'm looking up for constructor functions only where `foo` is supposed to be a class, thus the constructor lookup pattern is `foo::foo::foo(` for this case. – πάντα ῥεῖ Sep 16 '12 at 20:41
@g-makulik - You might be better off looking for destructors because of the `namespace foo { void foo(); }` problem. There is no confusion whether `Bar` is a class or a namespace when you see `foo::Bar::~Bar()`. That tilde is a dead giveaway because you can't use a tilde in a function name (except destructors, of course). – David Hammen Sep 17 '12 at 11:30

Extracting class from demangled symbol

2 Answers2

Linked