0

I want to select the word "String" from the line "String helloString String Stringhello helloStringhello".

Here should selected the 2 words "String"(first and the middle)

"String" in "helloString" or "Stringhello" or "helloStringhello" shouldn't be selected.

This is my RE:

<YYINITIAL> (String) {return new Token(TokenType.String,yytext());}

But it select any word "String".

My Jlex code:

import java.io.*;
enum TokenType {Type_String,Identifier}
class Token{
    String text;
  TokenType type;
  Token(TokenType type,String text)
  {
    this.text=text;
    this.type=type;
  }

  public String toString()
  {
    return String.format("[%s,%s]",type,text);
  }
}
%%
%class Lexer
%public
%function getNextToken
%type Token
%{
     public static void main(String[] args) throws IOException {
        FileReader r = new FileReader("in.txt");
        Lexer l = new Lexer(r);
        Token tok;
        while((tok=l.getNextToken())!=null){
            System.out.println(tok);
        } 
        r.close();
    }
%}
%line
%char
SPACE=[\r\t\n\f\ ]
ALPHA=[a-zA-Z]
DIGIT=[0-9]
ID=({ALPHA}|_)({ALPHA}|{DIGIT}|_)*



%%
<YYINITIAL> {ID} {return new Token(TokenType.Identifier,yytext());}
<YYINITIAL> (String) {return new Token(TokenType.Type_String,yytext());}
<YYINITIAL> {SPACE}* {}
<YYINITIAL> . {System.out.println("error - "+yytext());}
Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Adham Mostafa
  • 33
  • 2
  • 6

2 Answers2

0

If I run your code on your example input, I don't see the behaviour you describe. The words helloString etc. aren't recognized as tokens of type Type_String, but as tokens of type Identifier, which I assume is the intended behaviour. So that part is actually working fine.

What isn't working fine is that String by itself is also recognized as an identifier. The reason for that is that if two rules can produce a match of the same length, the rule that comes first is chosen. You've defined the rule for identifiers before the rule for the string keyword, so that's why it's always chosen. If you switch the two rules around, String by itself will be recognized as Type_String and everything else will be recognized as an identifier.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • yes you are right its work ,but my professor ask me how it done without switch two rules <> as the task is define data type`(int,double,String)` i did it but if input is` (inthello)` the out but give me `(int)=datatype` and `(hello)=identifier` so i want to select individual tokens not part of word – Adham Mostafa Nov 29 '20 at 16:08
  • @EngLeviAckerman I'm not sure I'm following you. Are you saying your professor specifically said that the `datatype` rule should be defined after the `identifier` rule and the behaviour should still be as if it were the other way around? I don't think that's possible. – sepp2k Nov 29 '20 at 16:13
  • yes,he said that and i am spend 2 days to solve it but can't – Adham Mostafa Nov 29 '20 at 16:16
  • @ sepp2k i am add my code bellow ,please check it – Adham Mostafa Nov 29 '20 at 16:17
  • 1
    @EngLeviAckerman Are you sure you didn't misunderstand what the professor meant? As far as I can see defining the type name rules before the identifier rules is the only sane way of getting the proper behaviour. – sepp2k Nov 29 '20 at 16:24
0

This is my second Jlex code:

import java.io.*;
enum TokenType {OutPut_Instruction,Quoted_Stentence,Semi,L_Pracet,R_Pracet,Type_int,Type_double,Type_String,Identifier}
class Token{
    String text;
  TokenType type;
  Token(TokenType type,String text)
  {
    this.text=text;
    this.type=type;
  }

  public String toString()
  {
    return String.format("[%s,%s]",type,text);
  }
}
%%
%class Lexer
%public
%function getNextToken
%type Token
%{
     public static void main(String[] args) throws IOException {
        FileReader r = new FileReader("in.txt");
        Lexer l = new Lexer(r);
        Token tok;
        while((tok=l.getNextToken())!=null){
            System.out.println(tok);
        } 
        r.close();
    }
%}
%line
%char
SPACE=[\r\t\n\f\ ]
SEMI_COLO=[;]
L_P=[(]
R_P=[)]
DOUBLE_COT="\""([^\n\"]*(\\[.])*)*"\""
PRINT=(Print)
ALPHA=[a-zA-Z]
DIGIT=[0-9]
INT=(int)
DOUBLE=(double)
STRING=(String)
TYPE=(int)|(double)|(String)
ID=({ALPHA}|_)({ALPHA}|{DIGIT}|_)*



%%
<YYINITIAL> {L_P} {return new Token(TokenType.L_Pracet,yytext());}
<YYINITIAL> {R_P} {return new Token(TokenType.R_Pracet,yytext());}
<YYINITIAL> {SEMI_COLO} {return new Token(TokenType.Semi,yytext());}
<YYINITIAL> {PRINT} {return new Token(TokenType.OutPut_Instruction,yytext());}
<YYINITIAL> [^{TYPE}\ ]{ID} {return new Token(TokenType.Identifier,yytext());}
<YYINITIAL> {INT} {return new Token(TokenType.Type_int,yytext());}
<YYINITIAL> {DOUBLE} {return new Token(TokenType.Type_double,yytext());}
<YYINITIAL> {STRING} {return new Token(TokenType.Type_String,yytext());}
<YYINITIAL> {DOUBLE_COT} {return new Token(TokenType.Quoted_Stentence,yytext());}
<YYINITIAL> {SPACE}* {}
<YYINITIAL> . {System.out.println("error - "+yytext());}

this is the input

> ah String ah Stringahmredgah Sahmed String int

this is the output

[Identifier,ah]
[Type_String,String]
[Identifier,ah]
[Type_String,String]
[Identifier,ahmredgah]
error - S
[Identifier,ahmed]
[Type_String,String]
[Type_int,int]
Adham Mostafa
  • 33
  • 2
  • 6
  • 1
    A negated character class matches exactly one character that's not inside the brackets. `[^{TYPE}\ ]` expands to `[^(int)|(double)|(String)\ ]` and will match any character that's not a parenthesis, not a space and not an `i`, an `n` or any of the other letters that appears inside the brackets. So what this character class does is to disallow any identifier that starts with an `S`, an `i` etc. (it also requires any identifier to consist of at least two characters). That's clearly not what you want and there is no easy way to do what you want here - JLex doesn't have a way to negate a regex. – sepp2k Nov 29 '20 at 16:29