1

I would like to write a Boost Spirit Qi parser that can parse arbitrary C integer literals (e.g. 1234 or 0x1234ULL) and convert them to arbitrary precision llvm::APInt values.

I imagine that to do this I need to combine separate parsers for decimal, hexadecimal etc. literals.

Taking the latter as an example, the parser would need to recognize tokens 0xNN...NS where N are hexadecimal digits and S is a valid literal suffix.

Constructing such a parser is easy but I make it "discard" prefix and suffix and return the remaining digits converted to llvm::APInt values the same way that e.g. qi::uint_ return unsigned integers?

I know there is qi::uint_parser but that class seems very limited since it seems to build up its results from integers as opposed to strings. This is a staple feature of other parser generators so I'm surprised that the documentation glosses over this.

sehe
  • 374,641
  • 47
  • 450
  • 633
Peter
  • 2,919
  • 1
  • 16
  • 35
  • Spirit should be able to parse into unbounded custom integers out of the box (like `uint_parser`), if it does not -- please file a bug report. – Nikita Kniazev Apr 25 '21 at 17:33
  • @NikitaKniazev I do not think you looked at that type, or its documentation. It's not intended as a general drop-in compatible integral number type. I posted my answer nearly simultaneously with your comment – sehe Apr 26 '21 at 00:37
  • @sehe I did, and it says `APInt is a functional replacement for common case unsigned integer type ... APInt provides a variety of arithmetic operators and methods to manipulate integer values of any bit-width. It supports both the typical integer arithmetic and comparison operations as well as bitwise manipulation.` https://llvm.org/doxygen/classllvm_1_1APInt.html#details – Nikita Kniazev Apr 26 '21 at 13:22
  • @NikitaKniazev yeah, that's like a marketing blurb. Note they don't say with what interface they support thses operations. I like to picture myself using a type while reading through the interface: https://godbolt.org/z/o75vMe5cc – sehe Apr 26 '21 at 20:45
  • @sehe it does not say the type supports implicit conversions. I have not made Spirit to support such types yet. Also, there is a parsing constructor https://llvm.org/doxygen/classllvm_1_1APInt.html#a337b62553d6b0e5ce2868e086b589a00 which for whatever reason no one mentioned here. – Nikita Kniazev Apr 27 '21 at 13:50
  • @NikitaKniazev Nobody mentioned implicit conversions. `ApInt(0)` similarly cannot compile. And yes, there are **other** interfaces, but that's hardly relevant. Insofar they **can be** relevant, I did mention it: [_"Of course, with semantic actions you can in fact parse the string representation using the fromString factory method"_](https://stackoverflow.com/questions/67253104/parse-arbitrary-precision-numbers-with-boost-spirit?noredirect=1#answer-67255937:~:text=Of%20course%2C%20with%20semantic%20actions%20you,representation%20using%20the%20fromString%20factory%20method). – sehe Apr 27 '21 at 14:46

1 Answers1

1

I think that what is a staple of parser generators is indeed parsing into arbitrary types of integers.

What you are after is more: you want to parse into a type that represents arbitrary types of integers with added semantic information, based on decisions in your grammar.

These decisions cannot be baked into the parser generator, because that would tie it to a particular type of grammars.

Of course, you can do that, too. Let me walk through step by step.

1. The Staple

As you have noted, Spirit does that. Let's demonstrate the basics.

Loosely after http://www.nongnu.org/hcb/#integer-literal

_suffix += "u", "l", "ll", "ul", "lu", "ull", "llu";

_start = qi::no_case[ // case insensitive
    ("0x"          >> qi::uint_parser<Integer, 16>{} |
     "0b"          >> qi::uint_parser<Integer, 2>{} |
     &qi::lit('0') >> qi::uint_parser<Integer, 8>{} |
                      qi::uint_parser<Integer, 10>{})
    // ignored for now:
    >> -_suffix];

As you can see it parses hex, binary, octal and decimal unsigned numbers with an optional suffix. We're ignoring the suffix for now, so that we can demonstrate that it parses into generalized integral types.

See a demo Live On Compiler Explorer

template <typename Integer> void test() {
    std::cout << " ---- " << __PRETTY_FUNCTION__ << "\n";
    using It = std::string::const_iterator;
    IntLiteral<It, Integer> const parser {};

    for (std::string const input : {
             "1234",
             "1234u",
             "0x12f34ULL",
             "033ULL",
             "0b101011l",
             "33lu"
         }) {
        Integer value;
        if (parse(input.begin(), input.end(), parser >> qi::eoi, value)) {
            std::cout << "Parsed " << std::quoted(input) << " -> " << value << "\n";
        } else {
            std::cout << "Failed to parse " << std::quoted(input) << "\n";
        }
    }
}

int main() {
    test<std::uintmax_t>();
    test<boost::multiprecision::checked_int1024_t>();
}

Prints

 ---- void test() [with Integer = long unsigned int]
Parsed "1234" -> 1234
Parsed "1234u" -> 1234
Parsed "0x12f34ULL" -> 77620
Parsed "033ULL" -> 27
Parsed "0b101011l" -> 43
Parsed "33lu" -> 33
 ---- void test() [with Integer = boost::multiprecision::number<boost::multiprecision::backend
s::cpp_int_backend<1024, 1024, boost::multiprecision::signed_magnitude, boost::multiprecision:
:checked, void> >]
Parsed "1234" -> 1234
Parsed "1234u" -> 1234
Parsed "0x12f34ULL" -> 77620
Parsed "033ULL" -> 27
Parsed "0b101011l" -> 43
Parsed "33lu" -> 33

2. Variant Type

Now, you actually want the result to reflect the literal's type.

You can do that without LLVM. E.g. by parsing into intmax_t first, and then coercing to the appropriate type based on the suffix.

Let's parse into

using CxxInteger = boost::variant<
    signed, unsigned, 
    signed long, unsigned long,
    signed long long, unsigned long long>;

Then parsing with:

using Raw = std::uintmax_t;

_start = no_case [ // case insensitive
    ("0x"       >> uint_parser<Raw, 16>{} |
     "0b"       >> uint_parser<Raw,  2>{} |
     &lit('0')  >> uint_parser<Raw,  8>{} |
                   uint_parser<Raw, 10>{})
    // ignored for now:
    >> _optsuffix
] [ _val = coerce_type(_1, _2) ];

_optsuffix = no_case[_suffix] | attr(Suffix::signed_);

Now, we have to write the semantic rules that apply to our grammar:

struct converter_f {
    CxxInteger operator()(uintmax_t raw, Suffix sfx) const {
        switch (sfx) {
          case Suffix::signed_:   return static_cast<signed>(raw);
          case Suffix::unsigned_: return static_cast<unsigned>(raw);
          case Suffix::long_:     return static_cast<long>(raw);
          case Suffix::longlong_: return static_cast<long long>(raw);
          case Suffix::ul_:       return static_cast<unsigned long>(raw);
          case Suffix::ull_:      return static_cast<unsigned long long>(raw);
        }
        throw std::invalid_argument("sfx");
    }
};
boost::phoenix::function<converter_f> coerce_type;

That's it. We can now parse the same test cases Live On Compiler Explorer

std::cout << "Parsed " << std::quoted(input) << " -> " << value
          << " (type #" << value.which() << " "
          << boost::core::demangle(value.type().name()) << ")\n";

Prints

 ---- void test()
Parsed "1234" -> 1234 (type #0 int)
Parsed "1234u" -> 1234 (type #1 unsigned int)
Parsed "0x12f34ULL" -> 77620 (type #5 unsigned long long)
Parsed "033ULL" -> 27 (type #5 unsigned long long)
Parsed "0b101011l" -> 43 (type #2 long)
Parsed "33lu" -> 33 (type #3 unsigned long)

3. Applying To LLVM APInt

The mechanics are the same:

struct converter_f {
    template <typename T> static auto as(uint64_t raw) {
        return llvm::APInt(raw, CHAR_BIT * sizeof(T), std::is_signed_v<T>);
    }
    llvm::APInt operator()(uintmax_t raw, Suffix sfx) const {
        switch (sfx) {
        case Suffix::signed_:   return as<signed>(raw);
        case Suffix::unsigned_: return as<unsigned>(raw);
        case Suffix::long_:     return as<long>(raw);
        case Suffix::longlong_: return as<long long>(raw);
        case Suffix::ul_:       return as<unsigned long>(raw);
        case Suffix::ull_:      return as<unsigned long long>(raw);
        }
        throw std::invalid_argument("sfx");
    }
};

Full Demo

"Live" On Compiler Explorer

(compiler explorer doesn't support linking to LLVM)

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <iomanip>
#include <llvm/ADT/APInt.h>
namespace qi = boost::spirit::qi;

template <typename It>
struct IntLiteral : qi::grammar<It, llvm::APInt()> {
    IntLiteral() : IntLiteral::base_type(_start) {
        using namespace qi;
        using Raw = std::uint64_t;

        _start = no_case [ // case insensitive
            ("0x"      >> uint_parser<Raw, 16>{} |
            "0b"      >> uint_parser<Raw,  2>{} |
            &lit('0') >> uint_parser<Raw,  8>{} |
                        uint_parser<Raw, 10>{})
            // ignored for now:
            >> _optsuffix
        ] [ _val = coerce_type(_1, _2) ];

        _optsuffix = no_case[_suffix] | attr(Suffix::signed_);
    }

private:
    enum class Suffix {
        signed_   = 0,
        unsigned_ = 1,
        long_     = 2,
        longlong_ = 4,

        l_   = long_,
        ll_  = longlong_,
        ul_  = unsigned_ | l_,
        ull_ = unsigned_ | ll_,
    };

    struct suffix_sym : qi::symbols<char, Suffix> {
        suffix_sym() {
            this->add
                ("u",   Suffix::unsigned_)
                ("l",   Suffix::l_)
                ("ll",  Suffix::ll_)
                ("ul",  Suffix::ul_)  ("lu",  Suffix::ul_)
                ("ull", Suffix::ull_) ("llu", Suffix::ull_)
            ;
        }
    } _suffix;

    struct converter_f {
        template <typename T> static auto as(uint64_t raw) {
            return llvm::APInt(CHAR_BIT * sizeof(T), raw, std::is_signed_v<T>);
        }
        llvm::APInt operator()(uint64_t raw, Suffix sfx) const {
            switch (sfx) {
            case Suffix::signed_:   return as<signed>(raw);
            case Suffix::unsigned_: return as<unsigned>(raw);
            case Suffix::long_:     return as<long>(raw);
            case Suffix::longlong_: return as<long long>(raw);
            case Suffix::ul_:       return as<unsigned long>(raw);
            case Suffix::ull_:      return as<unsigned long long>(raw);
            }
            throw std::invalid_argument("sfx");
        }
    };
    boost::phoenix::function<converter_f> coerce_type;

    qi::rule<It, llvm::APInt()> _start;
    qi::rule<It, Suffix()>      _optsuffix;
};

void test() {
    std::cout << " ---- " << __PRETTY_FUNCTION__ << "\n";
    using It = std::string::const_iterator;
    IntLiteral<It> const parser {};

    for (std::string const input : {
            "1234",
            "1234u",
            "0x12f34ULL",
            "033ULL",
            "0b101011l",
            "33lu"
        }) {
        llvm::APInt value;
        if (parse(input.begin(), input.end(), parser >> qi::eoi, value)) {
            std::cout << "Parsed " << std::quoted(input) << " -> "
                    << value.toString(10, false) // TODO signed?
                    << " bits:" << value.getBitWidth() << "\n";
        } else {
            std::cout << "Failed to parse " << std::quoted(input) << "\n";
        }
    }
}

int main() {
    test();
}

Prints

 ---- void test()
Parsed "1234" -> 1234 bits:32
Parsed "1234u" -> 1234 bits:32
Parsed "0x12f34ULL" -> 77620 bits:64
Parsed "033ULL" -> 27 bits:64
Parsed "0b101011l" -> 43 bits:64
Parsed "33lu" -> 33 bits:64

Remaining Loose Ends

  • Of course, with semantic actions you can in fact parse the string representation using the fromString factory method

  • I don't know how to accurate ask APInt whether it is signed. I suspect I should have been parsing into a variant<APInt, APSInt> to retain that information

  • I didn't put work into detecting overflows. The first example should have that out-of-the-box (thanks to Qi)

  • I also didn't put work into supporting c++14 digit separators because it wasn't specified. And it didn't seem part of any "staple" feature anyways.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • [Ironed out](https://stackoverflow.com/posts/67255937/revisions) an edge case with parsing "0" correctly due to the octal branch, and simplified the first samples. – sehe Apr 25 '21 at 17:11
  • The huge problem of this solution is that it ties the parser to host platform type widths. – Nikita Kniazev Apr 26 '21 at 13:26
  • @NikitaKniazev That's (a) not true (just a choice I made for exposition) (b) seemed the stated goal (_"I would like to write a Boost Spirit Qi parser that can parse arbitrary C integer literals (e.g. 1234 or 0x1234ULL"_). Just consider replacing `intmax_t` with Boost `cpp_int` and casting from there. It will even do bitfields if you want. – sehe Apr 26 '21 at 15:54
  • Dang, I wish I could upvote this twice, that's a much more extensive answer than I expected. – Peter Apr 26 '21 at 19:05