How can i select specific word in Regular Expression Jlex?

Question

I want to select the word "String" from the line "String helloString String Stringhello helloStringhello".

Here should selected the 2 words "String"(first and the middle)

"String" in "helloString" or "Stringhello" or "helloStringhello" shouldn't be selected.

This is my RE:

<YYINITIAL> (String) {return new Token(TokenType.String,yytext());}

But it select any word "String".

My Jlex code:

import java.io.*;
enum TokenType {Type_String,Identifier}
class Token{
    String text;
  TokenType type;
  Token(TokenType type,String text)
  {
    this.text=text;
    this.type=type;
  }

  public String toString()
  {
    return String.format("[%s,%s]",type,text);
  }
}
%%
%class Lexer
%public
%function getNextToken
%type Token
%{
     public static void main(String[] args) throws IOException {
        FileReader r = new FileReader("in.txt");
        Lexer l = new Lexer(r);
        Token tok;
        while((tok=l.getNextToken())!=null){
            System.out.println(tok);
        } 
        r.close();
    }
%}
%line
%char
SPACE=[\r\t\n\f\ ]
ALPHA=[a-zA-Z]
DIGIT=[0-9]
ID=({ALPHA}|_)({ALPHA}|{DIGIT}|_)*



%%
<YYINITIAL> {ID} {return new Token(TokenType.Identifier,yytext());}
<YYINITIAL> (String) {return new Token(TokenType.Type_String,yytext());}
<YYINITIAL> {SPACE}* {}
<YYINITIAL> . {System.out.println("error - "+yytext());}

What are your other rules? Specifically, do you have an identifier-like rule that can match `helloString` etc.? Because then it should just work as you want. — sepp2k, Nov 29 '20 at 14:39
`ALPHA=[a-zA-Z]` `DIGIT=[0-9]` `ID=({ALPHA}|_)({ALPHA}|{DIGIT}|_)*` ` {ID} {return new Token(TokenType.Identifier,yytext());}` — Adham Mostafa, Nov 29 '20 at 14:40
That looks like it should work fine. Please post a [MCVE] that I can play around with. — sepp2k, Nov 29 '20 at 15:00

score 0 · Answer 1 · answered Nov 29 '20 at 15:57

0

If I run your code on your example input, I don't see the behaviour you describe. The words helloString etc. aren't recognized as tokens of type Type_String, but as tokens of type Identifier, which I assume is the intended behaviour. So that part is actually working fine.

What isn't working fine is that String by itself is also recognized as an identifier. The reason for that is that if two rules can produce a match of the same length, the rule that comes first is chosen. You've defined the rule for identifiers before the rule for the string keyword, so that's why it's always chosen. If you switch the two rules around, String by itself will be recognized as Type_String and everything else will be recognized as an identifier.

answered Nov 29 '20 at 15:57

sepp2k

363,768
54
674
675

yes you are right its work ,but my professor ask me how it done without switch two rules <> as the task is define data type`(int,double,String)` i did it but if input is` (inthello)` the out but give me `(int)=datatype` and `(hello)=identifier` so i want to select individual tokens not part of word – Adham Mostafa Nov 29 '20 at 16:08
@EngLeviAckerman I'm not sure I'm following you. Are you saying your professor specifically said that the `datatype` rule should be defined after the `identifier` rule and the behaviour should still be as if it were the other way around? I don't think that's possible. – sepp2k Nov 29 '20 at 16:13
yes,he said that and i am spend 2 days to solve it but can't – Adham Mostafa Nov 29 '20 at 16:16
@ sepp2k i am add my code bellow ,please check it – Adham Mostafa Nov 29 '20 at 16:17
1

@EngLeviAckerman Are you sure you didn't misunderstand what the professor meant? As far as I can see defining the type name rules before the identifier rules is the only sane way of getting the proper behaviour. – sepp2k Nov 29 '20 at 16:24

score 0 · Answer 2 · answered Nov 29 '20 at 16:12

This is my second Jlex code:

import java.io.*;
enum TokenType {OutPut_Instruction,Quoted_Stentence,Semi,L_Pracet,R_Pracet,Type_int,Type_double,Type_String,Identifier}
class Token{
    String text;
  TokenType type;
  Token(TokenType type,String text)
  {
    this.text=text;
    this.type=type;
  }

  public String toString()
  {
    return String.format("[%s,%s]",type,text);
  }
}
%%
%class Lexer
%public
%function getNextToken
%type Token
%{
     public static void main(String[] args) throws IOException {
        FileReader r = new FileReader("in.txt");
        Lexer l = new Lexer(r);
        Token tok;
        while((tok=l.getNextToken())!=null){
            System.out.println(tok);
        } 
        r.close();
    }
%}
%line
%char
SPACE=[\r\t\n\f\ ]
SEMI_COLO=[;]
L_P=[(]
R_P=[)]
DOUBLE_COT="\""([^\n\"]*(\\[.])*)*"\""
PRINT=(Print)
ALPHA=[a-zA-Z]
DIGIT=[0-9]
INT=(int)
DOUBLE=(double)
STRING=(String)
TYPE=(int)|(double)|(String)
ID=({ALPHA}|_)({ALPHA}|{DIGIT}|_)*



%%
<YYINITIAL> {L_P} {return new Token(TokenType.L_Pracet,yytext());}
<YYINITIAL> {R_P} {return new Token(TokenType.R_Pracet,yytext());}
<YYINITIAL> {SEMI_COLO} {return new Token(TokenType.Semi,yytext());}
<YYINITIAL> {PRINT} {return new Token(TokenType.OutPut_Instruction,yytext());}
<YYINITIAL> [^{TYPE}\ ]{ID} {return new Token(TokenType.Identifier,yytext());}
<YYINITIAL> {INT} {return new Token(TokenType.Type_int,yytext());}
<YYINITIAL> {DOUBLE} {return new Token(TokenType.Type_double,yytext());}
<YYINITIAL> {STRING} {return new Token(TokenType.Type_String,yytext());}
<YYINITIAL> {DOUBLE_COT} {return new Token(TokenType.Quoted_Stentence,yytext());}
<YYINITIAL> {SPACE}* {}
<YYINITIAL> . {System.out.println("error - "+yytext());}

this is the input

> ah String ah Stringahmredgah Sahmed String int

this is the output

[Identifier,ah]
[Type_String,String]
[Identifier,ah]
[Type_String,String]
[Identifier,ahmredgah]
error - S
[Identifier,ahmed]
[Type_String,String]
[Type_int,int]

A negated character class matches exactly one character that's not inside the brackets. `[^{TYPE}\ ]` expands to `[^(int)|(double)|(String)\ ]` and will match any character that's not a parenthesis, not a space and not an `i`, an `n` or any of the other letters that appears inside the brackets. So what this character class does is to disallow any identifier that starts with an `S`, an `i` etc. (it also requires any identifier to consist of at least two characters). That's clearly not what you want and there is no easy way to do what you want here - JLex doesn't have a way to negate a regex. — sepp2k, Nov 29 '20 at 16:29

How can i select specific word in Regular Expression Jlex?

2 Answers2