How to detect partial unfinished token and join its pieces that are obtained from two consequent portions of input?

Question

I am writing toy terminal, where I use Flex to parse normal text and control sequences that I get from tty. One detail of Cocoa machinery is that it reads from tty by chunks of 1024 bytes so that any token described in my .lex file at any time can become broken into two parts: some bytes of a token are the last bytes of first 1024 chunk and remaining bytes are the very first bytes of next 1024 bytes chunk.

So I need to somehow:

First of all detect this situation: when a token is split between two 1024-byte chunks.
Remember the first part of a token
When second 1024-chunk arrives, restore that first part by putting it in front of this second chunk somehow.

I am completely new to Flex so I am looking for a right way to accomplish this.

I have created dumb simple lexer to assist this discussion.

My question about this demo is:

How can I detect that last "FO" (unfinished "FOO") token is actually an unfinished token that is it is not an exception to my grammar but just needs its "O" from next chunk of input?

Found unanswered thread on Flex mailing lists: https://sourceforge.net/p/flex/mailman/message/29716589/ asking exactly the same. — Stanislav Pankevich, Mar 27 '16 at 13:32

rici · Accepted Answer · 2017-11-17T18:14:05.160

1

You should let flex do the reading. It is designed to work that way; it will do all the buffering necessary, including the case where a token is split between two (or more) input buffers.

If you cannot simply read from stdin using the standard fread function, then you can redefine the way the flex-generated parser gets input by redefining the macro YY_INPUT. See the "Generated Parser" chapter of the flex manual for a description of this macro.

edited Nov 17 '17 at 18:14

answered Mar 27 '16 at 02:30

rici

234,347
28
237
341

Thanks for the answer. I have tried following the direction you pointed me to but still before running into custom modifaction `YY_INPUT` or `yywrap()` I need to somehow recognize broken token (step 1 of my question). Currently broken token is recognized as exception from my grammar. I have created dead simple lexer to assist this discussion (see my question). – Stanislav Pankevich Mar 27 '16 at 12:46
I didn't have enough background to understand your answer the first time I read it. I did exercise custom YY_INPUT in my example project and now I see how partial tokens are handled correctly by Flex. The trick is to not tell it "we are done" every time. – Stanislav Pankevich Mar 27 '16 at 17:35

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

I have accepted @rici's answer as correct one as it gave me important hint about redefining the macro YY_INPUT.

In this answer I just want to share some details for newbies like me.

I have used How to make YY_INPUT point to a string rather than stdin in Lex & Yacc (Solaris) as example of custom YY_INPUT and this made my artificial example to work correctly with partial tokens.

To make Flex work correctly with partial tokens, the input should not contain '\0' symbols, i.e. scanning process should be "endless". Here is how YY_INPUT is redefined:

int readInputForLexer(char *buffer, int *numBytesRead, int maxBytesToRead) {
    static int Flip = 0;

    if ((Flip++ % 2) == 0) {
        strcpy(buffer, "FOO F");

        *numBytesRead = 5; // IMPORTANT: this is 5, not 6, to cut off \0
    } else {
        strcpy(buffer, "OO FOO");
        *numBytesRead = 6; // IMPORTANT: this is 6, not 7, to cut off \0
    }

    return 0;
}

In this example partial token F-OO is glued by Flex into a correct one: FOO.

As @rici pointed out in his comment, correct way to stop scanning is to set: *numBytesRead = 0.

See also another answer by @rici on similar SO question: Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?.

See my example for further details.

edited May 23 '17 at 12:15

Community

1
1

answered Mar 27 '16 at 17:33

Stanislav Pankevich

11,044
8
69
129

That's not really correct. Flex doesn't care whether the string contains NUL bytes; it considers a NUL byte to be just another character. Of course, you should not pass a NUL byte which is not intended to be part of the input stream. To indicate EOF, YY_INPUT must indicate that zero bytes have been read by returning a length of zero. – rici Mar 27 '16 at 18:29
Thanks for the tip about returning a length of zero! Still if I return `*numBytesRead = 6` instead of 5 (first case) then Flex starts complaining about broken token. So my observation is quite empirical - I do need to not include `\0s` to returned strings if I want Flex to come back for more input (it doesn't come back otherwise even if numBytesRead is != 0). – Stanislav Pankevich Mar 27 '16 at 19:19
If you return length of 6 then flex will try to scan the NUL byte as part of the input stream. That will probably cause a parse error with the result that yylex is not called again. But it is not a signal to flex to stop scanning. – rici Mar 27 '16 at 19:41
I have updated my answer by adding your comment and removing my misguiding comments. By the way, setting length which includes '\0' indeed causes a parser error: `flex scanner jammed`. Could you shed light on why does it happen? – Stanislav Pankevich Mar 27 '16 at 20:06
I think I understand! That's because of NUL is separate character and I don't have any token corresponding to it. So it is just regular parsing error, correct? Thanks. – Stanislav Pankevich Mar 27 '16 at 20:11
Sounds right. Normally, you get a scanner jammed error if there is no applicable rule, and you have disabled the default rule. (And you ignore the warning when you run flex.) – rici Mar 27 '16 at 22:49

How to detect partial unfinished token and join its pieces that are obtained from two consequent portions of input?

2 Answers2

Linked