I have a lexer specified with the following definitions:
ws [ \t\n]+
punc (\.|\,|\!|\?)
word ({punc}|[a-zA-Z0-9])*
special (\%|\_|\&|\$|\#)
I have some utf-8 files that I need to parse, and naturally it blows when it comes to those characters. I know that similar questions were asked a few times in the past, but none of them did any help. I tried to use the approach given in this answer, but I failed. I guess the problem is in the definition of the word
above?
It would be really helpful if someone could give details on the general concept of using UTF-8 encoding with flex.