UTF-8 with lex (flex)

Question

I have a lexer specified with the following definitions:

ws      [ \t\n]+
punc            (\.|\,|\!|\?)
word        ({punc}|[a-zA-Z0-9])*
special         (\%|\_|\&|\$|\#)

I have some utf-8 files that I need to parse, and naturally it blows when it comes to those characters. I know that similar questions were asked a few times in the past, but none of them did any help. I tried to use the approach given in this answer, but I failed. I guess the problem is in the definition of the word above?

It would be really helpful if someone could give details on the general concept of using UTF-8 encoding with flex.

Yes, that is because Adobe chose a name for it's product which was already in use (since 1992, IIRC) — wildplasser, Dec 09 '12 at 23:00

score 2 · Accepted Answer · edited May 23 '17 at 12:29

2

Try (process -with flex -8):

%%
ws      [ \t\n]+
punc            (\.|\,|\!|\?)
word        ({punc}|[a-zA-Z0-9\x80-\xf3])*
special         (\%|\_|\&|\$|\#)

%%

(the coding is a bit course-grained ...) The link metioned by the OP, leading to Kaz's anwer is much more exact, wrt the allowed sequences.

edited May 23 '17 at 12:29

Community

1
1

answered Dec 09 '12 at 17:50

wildplasser

43,142
8
66
109

I examined the output again, and found out that it actually gave utf-8 output in the first place. (I was misled because my terminal does not have utf-8 support) However unicode characters were treated as separate words. Changing ``word``'s definition as you've said solved the problem. BTW is there a difference between xf3 and xf4 being the upper limit? xf4 is reserved for private use, and xf5+ are invalid, right? – osolmaz Dec 09 '12 at 22:54
my flex syntax was a little rusty, I actually typed this in as a guess, but flex *does* appear to have full 8bit support. Not all characters (and sequences) above 0x7f are valid utf sequences, you might want to be more restrictive in what you accept. – wildplasser Dec 09 '12 at 22:59
Look at the answer I linked in the question. I think they accomplished that, by trimming the invalid intervals? So it would be ``\x80-\xbf, \xc2-\xdf, \xe0-\xef, \xf0-\xf4``. You may edit your answer if you like – osolmaz Dec 09 '12 at 23:06
Kaz has the right answer wrt characters that are possible. It still allows illegal *sequences*, such as 0xc0 plus *two or more* 0x8x bytes. – wildplasser Dec 09 '12 at 23:12
1

On second sight: he is correct (of course) *OMG my DS9000 just launched nasal deamons* ... – wildplasser Dec 10 '12 at 00:16

UTF-8 with lex (flex)

1 Answers1