16

I am writing a utility which is supposed to parse C++ (and C) header files, extract the structs, enums, fields etc. and generate code in other languages based on the extracted information. I decided to use libclang for this.

I'm using a RecursiveASTVisitor and it seems I'm able to extract all the information I need, except for comments.

I want to have the comment which appears right above every declaration (field, struct, class, enum) read, and add its text when I generate the code in other languages.

The problem is that all the samples I saw which use comments use CxCursor and the C interface for clang, and I have no idea how to get the CxCursor in my context.

So - how can I extract comments while still using RecursiveASTVisitor?

Asaf
  • 4,317
  • 28
  • 48
  • You could study the source code of clang-fmt... – Kerrek SB Aug 12 '14 at 22:41
  • You mean you are writing another Doxygen? ;) [Yad, Yet Another Doxygen - or perhaps "Yet Other Doxygen Again", Yoda] – Mats Petersson Aug 12 '14 at 22:48
  • Perhaps Bison/Flex is the better start point to write tokenizer/parser? – Tanuki Aug 13 '14 at 04:05
  • @MatsPetersson - I don't want a separate documentation. I want to embed the relevant comment for each field/struct in the generated code (which will be in other languages - C#, Lua etc.) – Asaf Aug 13 '14 at 08:32
  • @Tanuki - I don't really know these, but from some googling it looks like they're non-c++ parsing specific, and that there's no canonical c++ parser implementation using them. The big advantage of libclang is that it actually *compiles* the code, so I get, for example, the byte sizes of the structs/fields/enums, or even bit sizes when I use bitfields. I didn't mention that the purpose is to be able to send and receive these data structures over the network, so simple parsing will not help here. – Asaf Aug 13 '14 at 08:41
  • @KerrekSB - On it. Will get back to you next year. :) – Asaf Aug 13 '14 at 08:41
  • I'm using Flex/Bison in my projects and it generates C++ code. For the serialization you'd better have a look into boost::serialization because you need to convert data for the network transfers. Also, did you see this: http://clang.llvm.org/doxygen/group__CINDEX.html – Tanuki Aug 13 '14 at 08:44
  • @Asaf: clang-fmt took a lot longer than a year to develop, so you're making good time. ("Let me just quickly parse some C++...") – Kerrek SB Aug 13 '14 at 09:05
  • @Tanuki - I'm trying to parse and get extra data (such as the field sizes) from C++, not only generate C++ code. Since the various communication endpoints include an embedded system, a C# utility and a lua script (wireshark plugin), boost:serialization is not really an option. A simple #pragma pack (since the endianness of all systems is the same) already does the trick. As for the link - thanks, but yes, I looked into this. I just don't know how to extract the CxCursor from the framework I'm using, as I said in the question. – Asaf Aug 13 '14 at 09:21
  • Unpack libclang, brew some tea and dive into the sources. :D – Tanuki Aug 13 '14 at 09:24
  • @KerrekSB studying the source code took a little less time than expected :) See my answer – Asaf Aug 13 '14 at 19:08
  • @Asaf: Hurrah - I think one of the main reasons for developing Clang was that GCC did not offer any road towards making code analysis like this possible... – Kerrek SB Aug 13 '14 at 20:34

2 Answers2

21

With some more digging up, I found this:

For any relevant visited Decl (VisitXXXDecl), I can do this:

virtual bool VisitDecl(Decl* d)
{
    ASTContext& ctx = d->getASTContext();
    SourceManager& sm = ctx.getSourceManager();

    const RawComment* rc = d->getASTContext().getRawCommentForDeclNoCache(d);
    if (rc)
    {
        //Found comment!
        SourceRange range = rc->getSourceRange();

        PresumedLoc startPos = sm.getPresumedLoc(range.getBegin());
        PresumedLoc endPos = sm.getPresumedLoc(range.getEnd());

        std::string raw = rc->getRawText(sm);
        std::string brief = rc->getBriefText(ctx);

        // ... Do something with positions or comments
    }

    // ...
}

Note that this identifies (as far as I could see...) comments which are in the line(s) above (and adjacent!) to the current declaration in the code, and which are in one of the following formats:

  • /// Comment
  • /** Comment */
  • //! Comment

For example, in the following case:

/// A field with a long long comment
/// A two-liner
long long LongLongData;

raw will be:

/// A field with a long long comment
    /// A two-liner

And brief will be:

A field with a long long comment A two-liner

Either way, it's good enough for my needs.

Asaf
  • 4,317
  • 28
  • 48
16

The above answer is perfect. But to make the API getRawCommentForDeclNoCache return normal comments like // or /* you need to provide option "-fparse-all-comments" while invoking clang. Because by default clang parses only Doxygen style comments.

Hemant
  • 767
  • 6
  • 20