How to deal with different newline conventions with Flex

Question

We're trying to parse the GEDCOM file format. According to its standard, all combinations of CR and LF are valid to denote a newline break.

It's a line based format, so often we want to match the rest of the line when we've already matched the number and the tag. An example of a rule would be

"NAME ".+               { /* deal with the name */ }

The newlines are matched by

[\r\n]+             {return ENDLINE;}

This works fine on Windows, because it converts \r\n to \n behind your back, but it doesn't on Linux. There, \r can be matched by the dot. Because Flex uses the longest matching rule, it will either include the \r in the data, or it will match a known tag to the UNKNOWNTAG rule because this technically correct match will be one byte longer

A solution could be to replace all dots with [^\r\n], but that seem inelegant. Is there a better way?

There is only a flag that switches between `. = [^\n]` and `. = «any char»`. I think the best you could do is define a pattern `ANY = [^\r\n]` and just use `"NAME " {ANY}+`. — kennytm, Apr 01 '17 at 20:49
I agree with @kennytm, or normalise gedcom files (`sed -ri 's/\r\n/\n/' *.degcom`) before lex, or convert it during lex input `main(){ yyin=popen("sed -r 's/\r\n/\n/'",r);yylex()) }` — JJoao, Apr 02 '17 at 09:53
I'm take a loot to GEDCOM format. Looks like you need use parser like bison to hadle this. If you will you it, you can define different tokens, and use `(\r\n|\n)` as EOL. So there will be no 'match the rest of the line', and this will be no problem :) — komar, Apr 02 '17 at 22:49
BTW, may be this can be usefull for you: http://gedcom-parse.sourceforge.net/ The GEDCOM parser library — komar, Apr 02 '17 at 22:51
Thanks, I think we'll continue using [^\r\n] to replace the dot. It's probably the neatest solution. We're not very fond of preprocessing. I've looked into using a complete parser library, but programming it ourselves is a more interesting learning experience :) — Kasper, Apr 06 '17 at 22:23

score 0 · Answer 1 · answered Apr 02 '17 at 22:44

0

You can use redefining YY_INPUT to replace \r to \n:

%{
#define YY_INPUT(buf,result,max_size) {  \
        int c = getc(yyin);              \
        buf[0] = (c == '\r') ? '\n' : c; \
        result = 1;                      \
    }
%}

answered Apr 02 '17 at 22:44

komar

861
5
8

Forcing flex to read one character at a time is a really bad idea. – rici Apr 03 '17 at 18:14
@rici How to replace all occurrences of something with other before it goes to flex. For example `B\65R` should match `BAR` case in flex (`\65` -> `A`). – Sourav Kannantha B Dec 24 '21 at 19:00
@sourav: That comment should probably be a [question](https://stackoverflow.com/questions/ask). – rici Dec 24 '21 at 23:04
Thanks, I have posted it now, [here](https://stackoverflow.com/questions/70478080/how-to-replace-some-characters-of-input-file-before-it-getting-lexed-in-flex). – Sourav Kannantha B Dec 25 '21 at 05:08

How to deal with different newline conventions with Flex

1 Answers1