How to match a string, but case-insensitively?

Question

Let's say that I want to match "beer", but don't care about case sensitivity.

Currently I am defining a token to be ('b'|'B' 'e'|'E' 'e'|'E' 'r'|'R') but I have a lot of such and don't really want to handle 'verilythisisaverylongtokenindeedomyyesitis'.

The antlr wiki seems to suggest that it can't be done (in antlr) ... but I just wondered if anyone had some clever tricks ...

According to the page you cite, I woulnd't say it's impossible in ANTLR. There is no off-the-shelf option for lexer definition to handle tokens in case insensitive way. But it can be done by implementing custom string/file stream that normalizes characters to a definite (e.g., UPPER) case. Then you will be able to define tokens in a standard way, e.g., `@tokens { BEER = 'BEER'; }`. — dzieciou, Feb 02 '12 at 23:42
Thanks for pointign that out (+1). I have updated the link to point at a copy on http://archive.org (AKA teh Wayback Machine) — Mawg says reinstate Monica, Oct 05 '16 at 08:12

WestCoastProjects · Answer 1 · 2021-09-22T21:11:22.073

I would like to add to the accepted answer: a ready -made set can be found at case insensitive antlr building blocks, and the relevant portion included below for convenience

fragment A:[aA];
fragment B:[bB];
fragment C:[cC];
fragment D:[dD];
fragment E:[eE];
fragment F:[fF];
fragment G:[gG];
fragment H:[hH];
fragment I:[iI];
fragment J:[jJ];
fragment K:[kK];
fragment L:[lL];
fragment M:[mM];
fragment N:[nN];
fragment O:[oO];
fragment P:[pP];
fragment Q:[qQ];
fragment R:[rR];
fragment S:[sS];
fragment T:[tT];
fragment U:[uU];
fragment V:[vV];
fragment W:[wW];
fragment X:[xX];
fragment Y:[yY];
fragment Z:[zZ];

So an example is

   HELLOWORLD : H E L L O W O R L D;

This should be the solution...this is clean and results in the least amount of boilerplate. — Ralph Caraveo, Mar 01 '18 at 00:45

score 17 · Accepted Answer · answered Dec 04 '09 at 03:10

17

How about define a lexer token for each permissible identifier character, then construct the parser token as a series of those?

beer: B E E R;

A : 'A'|'a';
B: 'B'|'b';

etc.

answered Dec 04 '09 at 03:10

Jonathan Feinberg

44,698
7
80
103

12

If you take this approach, I think the "beer" rule should probably be a lexer rule name in all caps (`BEER: B E E R;`), and each of the per-letter rules should be prefixed by the `fragment` keyword. This way you get "BEER" as a single token, rather than four tokens that individually mean nothing. – Darien Aug 20 '12 at 23:06

score 9 · Answer 3 · edited Mar 11 '23 at 18:13

9

A case-insensitive option was just added to ANTLR

options { caseInsensitive = true; }

https://github.com/antlr/antlr4/blob/master/doc/options.md#caseinsensitive

The old links are now broken, these should continue to work.

edited Mar 11 '23 at 18:13

Ivan Kochurkin

4,413
8
45
80

answered Dec 26 '21 at 01:29

R. C. Howell

1,001
10
11

The old links are broken because they are not actual since ANTLR 4.10 – Ivan Kochurkin Mar 11 '23 at 18:14

score 4 · Answer 4 · answered Mar 19 '14 at 00:16

4

Define case-insensitive tokens with

BEER: [Bb] [Ee] [Ee] [Rr];

answered Mar 19 '14 at 00:16

idrosid

7,983
5
44
41

score 1 · Answer 5 · answered Jan 25 '18 at 23:40

New documentation page has appeared in ANTLR GitHub repo: Case-Insensitive Lexing. You can use two approaches:

The one described in @javadba's answer
Or add a character stream to your code, which will transform an input stream to lower or upper case. Examples for the main languages you can find on the same doc page.

My opinion, it's better to use the first approach and have the grammar which describes all the rules. But if you use well-known grammar, for example from Grammars written for ANTLR v4, then second approach may be more appropriate.

You might then run into some inconsistencies with, e.g. turkish characters: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#toLowerCase() — Michal Bida, Dec 14 '18 at 12:30

score 0 · Answer 6 · answered Oct 06 '16 at 20:26

A solution I used in C#: use ASCII code to shift character to smaller case.

class CaseInsensitiveStream : Antlr4.Runtime.AntlrInputStream {
  public CaseInsensitiveStream(string sExpr)
     : base(sExpr) {
  }
  public override int La(int index) {
     if(index == 0) return 0;
     if(index < 0) index++;
     int pdx = p + index - 1;
     if(pdx < 0 || pdx >= n) return TokenConstants.Eof;
     var x1 = data[pdx];
     return (x1 >= 65 && x1 <= 90) ? (97 + x1 - 65) : x1;
  }
}

Why did you hard code `65`, `90` and `97`? Your could would be much more readable, and maintainable, if you used `A`, `Z` and `a` — Mawg says reinstate Monica, Jan 26 '18 at 07:40

How to match a string, but case-insensitively?

6 Answers6

Linked