Using Antlr to parse formulas with multiple locales

Question

I'm very new to Antlr, so forgive what may be a very easy question.

I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.

Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.

I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.

score 0 · Accepted Answer · edited May 23 '17 at 11:45

What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.

I don't have ANTLR and C# running here, but the Java demo should be pretty similar:

grammar LocaleDemo;

@lexer::header {
  import java.text.DecimalFormatSymbols;
  import java.util.Locale;
}

@lexer::members {

  private char decimalSeparator = '.';
  private char groupingSeparator = ',';

  public LocaleDemoLexer(CharStream input, Locale locale) {
    this(input);
    DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
    this.decimalSeparator = dfs.getDecimalSeparator();
    this.groupingSeparator = dfs.getGroupingSeparator();
  }
}

parse
 : .*? EOF
 ;

NUMBER
 : D D? ( DG D D D )* ( DS D+ )?
 ;

OTHER
 : .
 ;

fragment D  : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}?  . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;

To test the grammar above, run this class:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;

public class Main {

    private static void tokenize(String input, Locale locale) {

        LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
        System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);

        for (Token t : lexer.getAllTokens()) {
            System.out.printf("  %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
        }
    }

    public static void main(String[] args) throws Exception {

        tokenize("1.23", Locale.ENGLISH);
        tokenize("1.23", Locale.GERMAN);

        tokenize("12.345.678,90", Locale.ENGLISH);
        tokenize("12.345.678,90", Locale.GERMAN);
    }
}

which would print:

input='1.23', locale=en, tokens:
  NUMBER     '1.23'

input='1.23', locale=de, tokens:
  NUMBER     '1'
  OTHER      '.'
  NUMBER     '23'

input='12.345.678,90', locale=en, tokens:
  NUMBER     '12.345'
  OTHER      '.'
  NUMBER     '67'
  NUMBER     '8'
  OTHER      ','
  NUMBER     '90'

input='12.345.678,90', locale=de, tokens:
  NUMBER     '12.345.678,90'

Related Q&A's:

So far I have been able to adapt your solution, and I believe I can make work if I can solve this question. The _input.LA(1) call (which in the C# stuff is _input.La(1)) return an int with the value of the symbol. However the documentation on the function doesn't state what coding that is, whether it be ASCII, Unicode, etc. Do you know what the encoding is/should be? — Walter Williams, Mar 16 '16 at 20:41
In Java, the `.` meta character in a lexer rule matches any character in the range `0`..`\uFFFF` — Bart Kiers, Mar 16 '16 at 20:45
That would seem to suggest UTF16, and I have coded it as such. I'm still writing test cases but so far it appears to be working. — Walter Williams, Mar 16 '16 at 21:50

score 0 · Answer 2 · answered Mar 16 '16 at 21:55

As a follow-up to Bart's answer, this is the grammar I created with his suggestions:

grammar ExcelScript;



@lexer::header
{
using System;
using System.Globalization;
}

@lexer::members
{
    private Int32 listseparator = 44; // UTF16 value for comma
    private Int32 decimalseparator = 46; // UTF16 value for period

    /// <summary>
    /// Creates a new lexer object
    /// </summary>
    /// <param name="input">The input stream</param>
    /// <param name="locale">The locale to use in parsing numbers</param>
    /// <returns>A new lexer object</returns>
    public ExcelScriptLexer (ICharStream input, CultureInfo locale)
    : this(input)
    {
        this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
        this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);

        // special case for 8 locales where the list separator is a , and the number separator is a , too
        // Excel uses semicolon for list separator, so we will too
        if (this.listseparator == 44 && this.decimalseparator == 44)
            this.listseparator = 59; // UTF16 value for semicolon
    }
}


/*
 * Parser Rules
 */

formula
    :   numberLiteral
    |   Identifier
    |   '=' expression
    ;

expression
    :   primary                                     # PrimaryExpression
    |   Identifier arguments                                # FunctionCallExpression
    |   ('+' | '-') expression                              # UnarySignExpression
    |   expression ('*' | '/' | '%') expression                     # MulDivModExpression
    |   expression ('+' | '-') expression                       # AddSubExpression
    |   expression ('<=' | '>=' | '>' | '<') expression                 # CompareExpression
    |   expression ('=' | '<>') expression                      # EqualCompareExpression
    ;

primary
    :   '(' expression ')'                              # ParenExpression
    |   literal                                     # LiteralExpression
    |   Identifier                                  # IdentifierExpression
    ;

literal
    :   numberLiteral                                   # NumberLiteralRule
    |   booleanLiteral                                  # BooleanLiteralRule
    ;

numberLiteral
    :   IntegerLiteral
    |   FloatingPointLiteral
    ;

booleanLiteral
    :   TrueKeyword
    |   FalseKeyword
    ;

arguments
    :   '(' expressionList? ')'
    ;

expressionList
    :   expression (ListSeparator expression)*
    ;

/*
 * Lexer Rules
 */

AddOperator :   '+' ;
SubOperator :   '-' ;
MulOperator :   '*' ;
DivOperator :   '/' ;
PowOperator :   '^' ;
EqOperator  :   '=' ;
NeqOperator :   '<>' ;
LeOperator  :   '<=' ;
GeOperator  :   '>=' ;
LtOperator  :   '<' ;
GtOperator  :   '>' ;

ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;

TrueKeyword :   [Tt][Rr][Uu][Ee] ;
FalseKeyword    :   [Ff][Aa][Ll][Ss][Ee] ;

Identifier
    :   Letter (Letter | Digit)*
    ;

fragment Letter
    :   [A-Z_a-z]
    ;

fragment Digit
    :   [0-9]
    ;

IntegerLiteral
    :   '0'
    |   [1-9] [0-9]*
    ;

FloatingPointLiteral
    :   [0-9]+ DecimalSeparator [0-9]* Exponent?
    |   DecimalSeparator [0-9]+ Exponent?
    |   [0-9]+ Exponent
    ;

fragment Exponent
    :   ('e' | 'E') ('+' | '-')? ('0'..'9')+
    ;

WhiteSpace
    :   [ \t]+ -> channel(HIDDEN)
    ;

Using Antlr to parse formulas with multiple locales

2 Answers2