According to Regex - Should hyphens be escaped?, the hyphen should be treated as a character instead of range operator if it is either first or last. That might not apply to ANTLR4's regex-like lexer token definitions.
Separately, there are a couple of problems with your proposed definition of a COBOL word
IDENTIFIER : [a-zA-Z0-9]+ ([-_]+ [a-zA-Z0-9]+)*;
A COBOL word has the following rules:
- composed of the characters [A-Za-z0-9_-]
- may not start or end with a - dash
- may not start with an _ underscore
- must contain at least one upper or lower case alpha [A-Za-z]
I see two problems with the proposed definition above
- does not allow an underscore as the final character
- does not require an alpha character. For example, the above definition allows all digits.
I suggest the following ANTLR4 lexer definition for a COBOL word:
IDENTIFIER : ([0-9][0-9_-])? [A-Za-z] ([A-Za-z0-9_-][A-Za-z0-9_])? ;
// IBM Enterprise COBOL Language Reference V4.2
// Enterprise COBOL for z/OS
// Language Reference
// Version 4 Release 2
// SC23-8528-01
// Second Edition (August 2009)
// Page 9
// PDF page 31