How do I lex unicode characters in C?

Question

I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?

If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.

I had this problem a long time ago. Have no code for you - but try this: http://www.w3.org/2005/03/23-lex-U — jim mcnamara, Apr 15 '15 at 20:03
So will this help me out even if I'm not using this Flex/Lex thing? — metro-man, Apr 15 '15 at 20:06
I read it a while back and found it helpful. I did not get that you were rewriting lex completely. My bad. — jim mcnamara, Apr 15 '15 at 20:07
Re-writing Lex completely? Sorry I'm kind of confused, do you mean writing a lexer by hand? — metro-man, Apr 15 '15 at 20:10
It is me that is confused. If you completely wrote a functioning lexer, then I would guess you are trying to rewrite lex for flex or whatever. Try getting the PCRE source code - it supports UTF8 - so you can learn how Perl supports UTF8. My assumption: you are using routines. — jim mcnamara, Apr 15 '15 at 20:15
@jimmcnamara Ahh, I'm afraid I'm not using any regular expressions :( — metro-man, Apr 15 '15 at 20:17
Then, whoa, writing a lexer is much harder. For support for doing string operations on UTF text use the fontconfig library source is here: http://www.freedesktop.org/software/fontconfig/release/ — jim mcnamara, Apr 15 '15 at 20:28
Aww, that's no fun. I'm looking around on GitHub, and found [this](https://github.com/haipome/utf8). Looks a lot smaller and simpler to use, I could try and re-write something like this? — metro-man, Apr 15 '15 at 20:34
Although writing a lexer from scratch *for a specific language* is often harder than generating one via `lex` or a similar tool, it's not necessarily all that hard. Certainly not approaching as hard as writing a lexer *generator* such as `lex` itself. — John Bollinger, Apr 15 '15 at 20:51
As for which encodings you should support, that's a question of what you want to be able to parse. If you already support ASCII, then extending your existing code to cover UTF-8 will probably not be too hard. It would be worth considering handling other encodings by first converting to UTF-8 (perhaps on the fly) and then parsing the UTF-8. — John Bollinger, Apr 15 '15 at 20:55
Note that questions requesting recommendations for third-party libraries are out of scope on SO. Nevertheless, it might be worth your while to check out [ICU](http://site.icu-project.org/). — John Bollinger, Apr 15 '15 at 20:57
@JohnBollinger Hmm, the lexer already lexes my entire language, and it works by consuming each character in a character stream and recognizing tokens (so you're typical lexer, just a huge state machine). Do you think that the Github repository I linked in the comment above is a useful resource. Not to use, but to attempt to re-write or adapt into my Lexer? — metro-man, Apr 15 '15 at 21:09

score 3 · Answer 1 · answered Apr 25 '15 at 09:19

There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:

JavaCC - JavaCC generates lexical analyzers written in Java.

JFLex - A lexical analyzer generator for Java.

Quex - A fast universal lexical analyzer generator for C and C++.

FsLex - A lexer generator for byte and Unicode character input for F#

And, of course, there are the techniques used by W3.org and cited by @jim mcnamara at http://www.w3.org/2005/03/23-lex-U.

You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?

In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:

while ( not <<EOF>> ) {
  switch ( input_symbol ) {
    case ( state_symbol[0] ) :
         ...
    case ( state_symbol[1] ) :

        ...
    default:
        ....
   }
}

If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.

Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.

How do I lex unicode characters in C?

1 Answers1

Linked