UTF8 String: simple idea to make index() lookup O(1)

Question

Background

I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.

Idea

A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.

This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?

UPDATE

I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts). All good replies.

Most people use UTF-8 only for serializing text, not for in-memory manipulation. `wchar_t` is more common as a fixed-width Unicode char type which supports almost every reasonable Unicode code point. — tenfour, Nov 05 '11 at 13:16
If all you use is a single flag, then updating a single character increases in complexity to O(n): Consider a string with isAscii=true. All you know is that there exists at least one non-ASCII character in it. You then update a single character from 0xc0 to 0x40. Is the result an ASCII string? You have to rescan the whole string to find out. (You would have to make isAscii a count of non-ASCII characters, and every string update would have to maintain the count.) — Raymond Chen, Nov 05 '11 at 13:17
@tenfour you seem to be assuming Windows. Common practice elsewhere is to do UTF-8 in memory, however clumsy. — bmargulies, Nov 05 '11 at 13:30
Wrong kind of optimization, I'd say. You'll basically write a program that only works well for English speaking users, sucks mud everywhere else. Something you'll be happy to ignore. Ignoring perf problems with 4 billion potential users is never a very good idea. — Hans Passant, Nov 05 '11 at 14:28
@tenfour: `wchar_t` has nothing to do with Unicode. At all. It's for "the system's character encoding". There are systems on which that encoding isn't useful. [See my rant on the subject.](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) A better answer would be to keep the strings internally as UTF-32, perhaps in the sexy new `char32_t` type or `std::u32string` class. — Kerrek SB, Nov 05 '11 at 14:47
`wchar_t` is not an encoding, it's a datatype. Agree that UTF-32 is the bestest encoding for in-memory storage though. — tenfour, Nov 05 '11 at 14:58
@Raymond Chen: True. What I actually meant was an isNonAscii flag. Then it would be O(1). — MustafaM, Nov 05 '11 at 20:53
@tenfour: Yes, but I was specifically talking about the UTF8-CPP which uses in-memory UTF8. Other strings, such as QString and NSString use internal UTF-16 encodings. — MustafaM, Nov 05 '11 at 20:55
IsNonAscii would still be O(n). When a non-ASCII character turns into an ASCII one, you need to rescan the string to see if you need to set isNonAscii to false. — Raymond Chen, Nov 05 '11 at 21:25
The point would be moot for languages with immutable strings of course. — bobince, Nov 06 '11 at 10:39

score 3 · Accepted Answer · answered Nov 05 '11 at 13:15

I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.

In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.

score 1 · Answer 2 · answered Nov 05 '11 at 13:11

1

"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.

answered Nov 05 '11 at 13:11

bmargulies

97,814
39
186
310

score 1 · Answer 3 · answered Nov 05 '11 at 23:44

If you want to speed up the UTF8 case...

First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.

Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.

You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

UTF8 String: simple idea to make index() lookup O(1)

Background

Idea

UPDATE

3 Answers3