0

Blow is my grammar file.

grammar My;

tokens {
    DELIMITER
}

string:SINGLE_QUOTED_TEXT;

SINGLE_QUOTED_TEXT: (
        '\'' (.)*? '\''
    )+
;

I'm trying to use this to accpet all string(It's part of mysql's g4 actually). Then I use this code to test it:

#include "MyLexer.h"
#include "MyParser.h"
#include <string>
using namespace My;

int main()
{
    std::string s = "'中'";

    antlr4::ANTLRInputStream input(s);
    MyLexer lexer(&input);

    antlr4::CommonTokenStream tokens(&lexer);
    MyParser parser(&tokens);

    parser.string();

    return 0;
}

Result is enter image description here

The Chinese character 中's utf8 code is 3 bytes: \xe4 \xb8 \xad

Both grammar file and code file are encoded in utf8. What can I to to let this work fine.

  • Why does the output in the screenshot contain double quotes when the string in the code only contains single quotes? Are you sure you're running the same code you posted here? – sepp2k Nov 23 '21 at 16:11
  • Sorry, right screenshot has been updated@sepp2k – user9634413 Nov 23 '21 at 16:18
  • The grammar/input file works fine for the C# and Java targets, v4.9.2. This could be a C++ runtime issue, but my tool to generate and build a C++ targeted parser isn't working, so I can't check. Aside, I don't know why you make a "tokens" declaration, and people normally call a "SINGLE_QUOTED_TEXT" a single-quoted string, not multiple via the +-operator closure. – kaby76 Nov 23 '21 at 22:04

1 Answers1

0

I'v figured out the problem.

Reference to https://stackoverflow.com/a/26865200/9634413

Antlr C++ runtime use a std::u32string to storage input, \xe4 will be casted to \xffffffe4, which is out of unicode range [0,0x10ffff].

To fix this problem, just override ANTLRInputStream's constructor like:

class MyStream : public antlr4::ANTLRInputStream {
public:
    MyStream(const std::string& input = "")
        : antlr4::ANTLRInputStream(input)
    {
        // Remove the UTF-8 BOM if present
        const char bom[4] = "\xef\xbb\xbf";
        if (input.compare(0, 3, bom, 3) == 0) {
            std::transform(input.begin() + 3, input.end(), _data.begin(),
                [](char c) -> unsigned char { return c; });
        }
        else {
            std::transform(input.begin(), input.end(), _data.begin(),
                [](char c) -> unsigned char { return c; });
        }
        p = 0;
    }
    MyStream(const char data_[], size_t numberOfActualCharsInArray)
        : antlr4::ANTLRInputStream(data_, numberOfActualCharsInArray)
    {
    }
    MyStream(std::istream& stream)
        : antlr4::ANTLRInputStream(stream)
    {
    }
};