1

I have a lexer specified with the following definitions:

ws      [ \t\n]+
punc            (\.|\,|\!|\?)
word        ({punc}|[a-zA-Z0-9])*
special         (\%|\_|\&|\$|\#)

I have some utf-8 files that I need to parse, and naturally it blows when it comes to those characters. I know that similar questions were asked a few times in the past, but none of them did any help. I tried to use the approach given in this answer, but I failed. I guess the problem is in the definition of the word above?

It would be really helpful if someone could give details on the general concept of using UTF-8 encoding with flex.

Community
  • 1
  • 1
osolmaz
  • 1,873
  • 2
  • 24
  • 41

1 Answers1

2

Try (process -with flex -8):

%%
ws      [ \t\n]+
punc            (\.|\,|\!|\?)
word        ({punc}|[a-zA-Z0-9\x80-\xf3])*
special         (\%|\_|\&|\$|\#)

%%

(the coding is a bit course-grained ...) The link metioned by the OP, leading to Kaz's anwer is much more exact, wrt the allowed sequences.

Community
  • 1
  • 1
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I examined the output again, and found out that it actually gave utf-8 output in the first place. (I was misled because my terminal does not have utf-8 support) However unicode characters were treated as separate words. Changing ``word``'s definition as you've said solved the problem. BTW is there a difference between xf3 and xf4 being the upper limit? xf4 is reserved for private use, and xf5+ are invalid, right? – osolmaz Dec 09 '12 at 22:54
  • my flex syntax was a little rusty, I actually typed this in as a guess, but flex *does* appear to have full 8bit support. Not all characters (and sequences) above 0x7f are valid utf sequences, you might want to be more restrictive in what you accept. – wildplasser Dec 09 '12 at 22:59
  • Look at the answer I linked in the question. I think they accomplished that, by trimming the invalid intervals? So it would be ``\x80-\xbf, \xc2-\xdf, \xe0-\xef, \xf0-\xf4``. You may edit your answer if you like – osolmaz Dec 09 '12 at 23:06
  • Kaz has the right answer wrt characters that are possible. It still allows illegal *sequences*, such as 0xc0 plus *two or more* 0x8x bytes. – wildplasser Dec 09 '12 at 23:12
  • 1
    On second sight: he is correct (of course) *OMG my DS9000 just launched nasal deamons* ... – wildplasser Dec 10 '12 at 00:16