antlr 4 iso-8859-15 encoded file matching string containing \u0161 š

Question

I have this grammar:

KEY
: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]* 
;

Reading a ISO-8859-15 encoded text file

new ANTLRFileStream(fileName, "ISO-8859-15")

with the string Milešovka. Why is š giving a token recognition error?

Trace:

 line 110:6 token recognition error at: ''exit    field, LT(1)={

EDIT: I am using antlr 4.5.1 (and have tested 4.4 - same issue).

Does ANTLRFileStream always provide a stream of *Unicode* characters to the lexer? [Then \u0161 would be right] Or is that encoding just a way to tell it to read 8 bit bytes, without interpreting them? [Then \u00a8 would be the correct code for "š".] — Ira Baxter, Jan 28 '16 at 10:09
Correcting my self: Using \u00a8 does work.Ira Baxter your seems to be correct. Encoding just way to tell it to read 8 bit bytes. — simsulla, Jan 28 '16 at 11:31
The ANTLRFileStream scheme seems singularly silly. If ANTLR is going to handle "16 bit" codes, why would it not always run using the Unicode character set? What this means is that your lexer depends on the encoding of your file, which will change based on locale and even the direction of the wind. [I guessed what your problem was based on similar silliness we had 15 years ago with our parsing tools, that made us go solve the encoding problem right]. — Ira Baxter, Jan 28 '16 at 14:08

Dzmitry Paulenka · Answer 1 · 2016-01-28T11:04:08.343

I think the problem might be in a way you use to generate parser. I'm not sure what exactly could go wrong, but I managed to do a working example with your symbol, that uses maven to generate grammar.

pom.xml

<build>
    <plugins>
        <plugin>
            <groupId>org.antlr</groupId>
            <artifactId>antlr4-maven-plugin</artifactId>
            <version>4.5</version>
            <configuration>
                <outputDirectory>src/main/java</outputDirectory>
                <listener>false</listener>
                <visitor>true</visitor>
            </configuration>
            <executions>
                <execution>
                    <goals>
                        <goal>antlr4</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.7</source>
                <target>1.7</target>
            </configuration>
        </plugin>
    </plugins>
</build>

<dependencies>
    <dependency>
        <groupId>org.antlr</groupId>
        <artifactId>antlr4-runtime</artifactId>
        <version>4.5.1</version>
    </dependency>
</dependencies>

LexerGrammar.g

lexer grammar TestLexer;

LBR: '[';
RBR: ']';
KEY
: [a-zA-Z\u0160\u0161\u00C0-\u00FF][a-zA-Z_0-9\-\''\u0160\u0161\u00C0-\u00FF]*
;

ParserGrammar.g

parser grammar TestParser;

options { tokenVocab=TestLexer; }

rul   : block+ ;
block  : LBR KEY RBR ;

Full example code is here

Also make sure that your file is actually in `ISO-8859-15`, some editors might automatically save in "UTF-8". To test this try to actually use `UTF-8` to read file. — Dzmitry Paulenka, Jan 28 '16 at 10:49
It is ansi encoded (ISO-8859-15) in notepad++ Milešovka If i set it to UTF-8 Mileۯvka EDIT: š is x9A — simsulla, Jan 28 '16 at 10:57
So with the maven build, what does your grammar, .g4 file look like? — simsulla, Jan 28 '16 at 10:59
Correcting my previous comment: If set to ISO-8859-15 in notepad++: Milešovka If i set it to UTF-8: Mileۯvka (i.e. x00a8) correcting my previous comment.... as I tested with ANSI encoded as well — simsulla, Jan 28 '16 at 11:17

score 0 · Answer 2 · answered Jan 28 '16 at 11:36

Ira Baxter's comment answers the question:

Does ANTLRFileStream always provide a stream of Unicode characters to the lexer? [Then \u0161 would be right] Or is that encoding just a way to tell it to read 8 bit bytes, without interpreting them? [Then \u00a8 would be the correct code for "š".]

antlr 4 iso-8859-15 encoded file matching string containing \u0161 š

2 Answers2