2

Background

I am using the UTF8-CPP class. The vast majority of my strings are using the ASCII character set (0-127). The problem with UTF8-based strings is that the index function (i.e. to retrieve a character a specific position) is slow.

Idea

A simple technique is to use a flag as a property which basically says if the string is pure ASCII or note (isAscii). This flag would be updated whenever the string is modified.

This solution seems too simple, and there may be things I am overlooking. But, if this solution is viable, does it not provide the best of both worlds (i.e. Unicode when needed and performance for the vast majority of cases), and would it not gaurantee O(1) for index loopkups?

UPDATE

I'm going to attach a diagram to clarify what I mean. I think a lot of people are misunderstanding what I mean (or I am misunderstanding basic concepts). All good replies.

MustafaM
  • 493
  • 1
  • 4
  • 14
  • what utf-8 string class? Have you looked at icu4c? – bmargulies Nov 05 '11 at 13:08
  • 4
    Most people use UTF-8 only for serializing text, not for in-memory manipulation. `wchar_t` is more common as a fixed-width Unicode char type which supports almost every reasonable Unicode code point. – tenfour Nov 05 '11 at 13:16
  • 3
    If all you use is a single flag, then updating a single character increases in complexity to O(n): Consider a string with isAscii=true. All you know is that there exists at least one non-ASCII character in it. You then update a single character from 0xc0 to 0x40. Is the result an ASCII string? You have to rescan the whole string to find out. (You would have to make isAscii a count of non-ASCII characters, and every string update would have to maintain the count.) – Raymond Chen Nov 05 '11 at 13:17
  • 1
    @tenfour you seem to be assuming Windows. Common practice elsewhere is to do UTF-8 in memory, however clumsy. – bmargulies Nov 05 '11 at 13:30
  • 1
    Wrong kind of optimization, I'd say. You'll basically write a program that only works well for English speaking users, sucks mud everywhere else. Something you'll be happy to ignore. Ignoring perf problems with 4 billion potential users is never a very good idea. – Hans Passant Nov 05 '11 at 14:28
  • 1
    @tenfour: `wchar_t` has nothing to do with Unicode. At all. It's for "the system's character encoding". There are systems on which that encoding isn't useful. [See my rant on the subject.](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) A better answer would be to keep the strings internally as UTF-32, perhaps in the sexy new `char32_t` type or `std::u32string` class. – Kerrek SB Nov 05 '11 at 14:47
  • `wchar_t` is not an encoding, it's a datatype. Agree that UTF-32 is the bestest encoding for in-memory storage though. – tenfour Nov 05 '11 at 14:58
  • @Raymond Chen: True. What I actually meant was an isNonAscii flag. Then it would be O(1). – MustafaM Nov 05 '11 at 20:53
  • @tenfour: Yes, but I was specifically talking about the UTF8-CPP which uses in-memory UTF8. Other strings, such as QString and NSString use internal UTF-16 encodings. – MustafaM Nov 05 '11 at 20:55
  • IsNonAscii would still be O(n). When a non-ASCII character turns into an ASCII one, you need to rescan the string to see if you need to set isNonAscii to false. – Raymond Chen Nov 05 '11 at 21:25
  • The point would be moot for languages with immutable strings of course. – bobince Nov 06 '11 at 10:39

3 Answers3

3

I think the point here is that while your vast majority of strings is ASCII, in general, the designer of an UTF-8 library should expect general UTF-8 strings. And there, checking and setting this flag is an unnecessary overhead.

In your case, it might be worth the effort to wrap or modify the UTF8 class accordingly. But before you do that, ask your favorite profiler if it's worth it.

thiton
  • 35,651
  • 4
  • 70
  • 100
1

"It depends" on your needs for thread safety and updates, and the length of your strings, and how many you've got. In other words, only profiling your idea in your real application will tell you if it makes things better or worse.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
1

If you want to speed up the UTF8 case...

First, consider sequential indexing of code points, thus avoiding counting them from the very beginning of the string again and again. Implement and use routines to index the next and the previous code points.

Second, you may build an array of indices into the UTF8 string's code points and use it as the first step while searching, it will give you an approximate location of the sought code point.
You may either have it (the array) of a fixed size, in which case you will still get search time ~ O(n) with O(1) memory cost, or have it contain equally-spaced indices (that is, indices into every m'th code point, where m is some constant), in which case you will get search time ~ O(m+log(n)) with O(n) memory cost.

You could also embed indices inside the code point data encoding them as reserved/unused/etc code points or use invalid encoding (say, first byte being 11111110 binary, then, for example, 6 10xxxxxx bytes containing the index, or whatever you like).

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180