118

What does fragment mean in ANTLR?

I've seen both rules:

fragment DIGIT : '0'..'9';

and

DIGIT : '0'..'9';

What is the difference?

Ronald Wildenberg
  • 31,634
  • 14
  • 90
  • 133
Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124

4 Answers4

135

A fragment is somewhat akin to an inline function: It makes the grammar more readable and easier to maintain.

A fragment will never be counted as a token, it only serves to simplify a grammar.

Consider:

NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;

In this example, matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777".

See item 3

sirbrialliance
  • 3,612
  • 1
  • 25
  • 15
  • 52
    You're right about what `fragment` means in ANTLR. But the example you give is a poor one: you don't want a lexer to produce a `NUMBER` token that can be a hex, decimal or octal number. That would mean you'd need to inspect the `NUMBER` token in a production (parser rule). You could better let the lexer produce `INT`, `OCT`, and `HEX` tokens and create a production rule: `number : INT | OCT | HEX;`. In such an example, a `DIGIT` could be a fragment which would be used by the tokens `INT` and `HEX`. – Bart Kiers Jun 28 '11 at 09:36
  • 16
    Note that "poor" might sound a bit harsh, but I couldn't find a better word for it... Sorry! :) – Bart Kiers Jun 28 '11 at 09:41
  • 5
    You were not sounding harsh .. you were right and straight! – asyncwait Feb 08 '14 at 09:43
  • 4
    Importantly, fragments are intended to be used only in other lexer rules to define other lexer tokens. Fragments are not meant to be used in grammar (parser) rules. – djb Aug 21 '14 at 14:58
  • 1
    @BartKiers: could you create a new answer including your better answer. – David Newcomb Dec 06 '15 at 19:57
  • The problem with this answer is that it doesn't explain the difference at all. It says that matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777". But that statement remains just as true even if you remove all instances of the keyword "fragment" from the above grammar! – Vesal Dec 06 '18 at 21:08
  • 1
    The definition of DIGITS does not allow the number zero by itself. – dmolony Mar 24 '19 at 10:00
23

According to the Definitive Antlr4 references book :

Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.

actually they'll improve readability of your grammars.

look at this example :

STRING : '"' (ESC | ~["\\])* '"' ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;

STRING is a lexer using fragment rule like ESC .Unicode is used in Esc rule and Hex is used in Unicode fragment rule. ESC and UNICODE and HEX rules can't be used explicitly.

Nastaran Hakimi
  • 695
  • 1
  • 16
  • 36
19

The Definitive ANTLR 4 Reference (Page 106):

Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.


Abstract Concepts:

Case1: ( if I need the RULE1, RULE2, RULE3 entities or group info )

rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;


Case2: ( if I don't care RULE1, RULE2, RULE3, I just focus on RULE0 )

RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
// RULE0 is a terminal node. 
// You can't name it 'rule0', or you will get syntax errors:
// 'A-C' came as a complete surprise to me while matching alternative
// 'DEF' came as a complete surprise to me while matching alternative


Case3: ( is equivalent to Case2, making it more readable than Case2)

RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
// You can't name it 'rule0', or you will get warnings:
// warning(125): implicit definition of token RULE1 in parser
// warning(125): implicit definition of token RULE2 in parser
// warning(125): implicit definition of token RULE3 in parser
// and failed to capture rule0 content (?)


Differences between Case1 and Case2/3 ?

  1. The lexer rules are equivalent
  2. Each of RULE1/2/3 in Case1 is a capturing group, similar to Regex:(X)
  3. Each of RULE1/2/3 in Case3 is a non-capturing group, similar to Regex:(?:X) enter image description here



Let's see a concrete example.

Goal: identify [ABC]+, [DEF]+, [GHI]+ tokens

input.txt

ABBCCCDDDDEEEEE ABCDE
FFGGHHIIJJKK FGHIJK
ABCDEFGHIJKL


Main.py

import sys
from antlr4 import *
from AlphabetLexer import AlphabetLexer
from AlphabetParser import AlphabetParser
from AlphabetListener import AlphabetListener

class MyListener(AlphabetListener):
    # Exit a parse tree produced by AlphabetParser#content.
    def exitContent(self, ctx:AlphabetParser.ContentContext):
        pass

    # (For Case1 Only) enable it when testing Case1
    # Exit a parse tree produced by AlphabetParser#rule0.
    def exitRule0(self, ctx:AlphabetParser.Rule0Context):
        print(ctx.getText())
# end-of-class

def main():
    file_name = sys.argv[1]
    input = FileStream(file_name)
    lexer = AlphabetLexer(input)
    stream = CommonTokenStream(lexer)
    parser = AlphabetParser(stream)
    tree = parser.content()
    print(tree.toStringTree(recog=parser))

    listener = MyListener()
    walker = ParseTreeWalker()
    walker.walk(listener, tree)
# end-of-def

main()


Case1 and results:

Alphabet.g4 (Case1)

grammar Alphabet;

content : (rule0|ANYCHAR)* EOF;

rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;

ANYCHAR : . -> skip;

Result:

# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL

$ python3 Main.py input.txt 
(content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
ABBCCC
DDDDEEEEE
ABC
DE
FF
GGHHII
F
GHI
ABC
DEF
GHI


Case2/3 and results:

Alphabet.g4 (Case2)

grammar Alphabet;

content : (RULE0|ANYCHAR)* EOF;

RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;

ANYCHAR : . -> skip;

Alphabet.g4 (Case3)

grammar Alphabet;

content : (RULE0|ANYCHAR)* EOF;

RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;

ANYCHAR : . -> skip;

Result:

# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL

$ python3 Main.py input.txt 
(content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)

Did you see "capturing groups" and "non-capturing groups" parts?




Let's see the concrete example2.

Goal: identify octal / decimal / hexadecimal numbers

input.txt

0
123
 1~9999
 001~077
0xFF, 0x01, 0xabc123


Number.g4

grammar Number;

content
    : (number|ANY_CHAR)* EOF
    ;

number
    : DECIMAL_NUMBER
    | OCTAL_NUMBER
    | HEXADECIMAL_NUMBER
    ;

DECIMAL_NUMBER
    : [1-9][0-9]*
    | '0'
    ;

OCTAL_NUMBER
    : '0' '0'..'9'+
    ;

HEXADECIMAL_NUMBER
    : '0x'[0-9A-Fa-f]+
    ;

ANY_CHAR
    : .
    ;


Main.py

import sys
from antlr4 import *
from NumberLexer import NumberLexer
from NumberParser import NumberParser
from NumberListener import NumberListener

class Listener(NumberListener):
    # Exit a parse tree produced by NumberParser#Number.
    def exitNumber(self, ctx:NumberParser.NumberContext):
        print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
            ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
    # end-of-def
# end-of-class

def main():
    input = FileStream(sys.argv[1])
    lexer = NumberLexer(input)
    stream = CommonTokenStream(lexer)
    parser = NumberParser(stream)
    tree = parser.content()
    print(tree.toStringTree(recog=parser))

    listener = Listener()
    walker = ParseTreeWalker()
    walker.walk(listener, tree)
# end-of-def

main()


Result:

# Input data (for reference)
# 0
# 123
#  1~9999
#  001~077
# 0xFF, 0x01, 0xabc123

$ python3 Main.py input.txt 
(content (number 0) \n (number 123) \n   (number 1) ~ (number 9999) \n   (number 001) ~ (number 077) \n (number 0xFF) ,   (number 0x01) ,   (number 0xabc123) \n <EOF>)
       0, dec: 0       , oct: None    , hex: None    
     123, dec: 123     , oct: None    , hex: None    
       1, dec: 1       , oct: None    , hex: None    
    9999, dec: 9999    , oct: None    , hex: None    
     001, dec: None    , oct: 001     , hex: None    
     077, dec: None    , oct: 077     , hex: None    
    0xFF, dec: None    , oct: None    , hex: 0xFF    
    0x01, dec: None    , oct: None    , hex: 0x01    
0xabc123, dec: None    , oct: None    , hex: 0xabc123

If you add the modifier 'fragment' to DECIMAL_NUMBER, OCTAL_NUMBER, HEXADECIMAL_NUMBER, you won't be able to capture the number entities (since they are not tokens anymore). And the result will be:

$ python3 Main.py input.txt 
(content 0 \n 1 2 3 \n   1 ~ 9 9 9 9 \n   0 0 1 ~ 0 7 7 \n 0 x F F ,   0 x 0 1 ,   0 x a b c 1 2 3 \n <EOF>)
蔡宗容
  • 927
  • 12
  • 9
9

This blog post has a very clear example where fragment makes a significant difference:

grammar number;  

number: INT;  
DIGIT : '0'..'9';  
INT   :  DIGIT+;

The grammar will recognize '42' but not '7'. You can fix it by making digit a fragment (or moving DIGIT after INT).

Vesal
  • 181
  • 2
  • 3
  • 1
    The problem here is not the absence of the keyword `fragment`, but the order of the lexer rules. –  Dec 05 '18 at 16:09
  • I used the word "fix", but the point is not to fix a problem. I added this example here because, for me, this was the most helpful and simple example of what actually *changes* when using the keyword fragment. – Vesal Dec 06 '18 at 17:22
  • 2
    I'm just arguing that declaring `DIGIT` as a fragment of `INT` solves the problem just because fragments do not define tokens, thus making `INT` the first lexical rule. I agree with you that this is a meaningful example but (imo) only for who already knows what the `fragment` keyword means. I find it somewhat misleading for someone who is trying to figure out the correct use of fragments for the first time. –  Dec 06 '18 at 17:52
  • 1
    So when I was learning this, I saw plenty of examples such as the ones above, but I didn't quite see why one would need an extra keyword for this. I didn't understand what this not being "tokens in their own right" meant in practice. Now, I'm not actually sure what would be good answer to the original question. I will add a comment above why I'm not satisfied with the accepted answer. – Vesal Dec 06 '18 at 20:56