How do I pad a Unicode string to a specific visible length?

Question

I'd like to create a left pad function in a programming language. The function pads a string with leading characters to a specified total length. Strings are UTF-16 encoded in this language.

There are a few things in Unicode that make it complicated:

Surrogates: 2 surrogate characters = 1 unicode character
Combining characters: 1 non-combining character + any number of combining characters = 1 visible character
Invisible characters: 1 invisible character = 0 visible characters

What other factors have to be taken into consideration, and how would they be dealt with?

-1 for the rant, it's very unconstructive. Characters and writing systems *are* complicated. Trying to deal with all the world's writing systems *without* Unicode would be even more of a headache. And even ASCII has invisible characters which give enough programmers headaches that *don't understand them*. — deceze, Aug 02 '13 at 18:03
I think Unicode is only praised because there isn't a better standard. — simplesoft55, Aug 02 '13 at 18:12
Maybe so, but unless you propose a better standard which deals with all the world's writing systems and all their little problems and actually come up with something that is better than Unicode: just deal with it. — deceze, Aug 02 '13 at 18:13
This is not a programming question, even though it vaguely mentions creating a function in "a programming language". This is a personal rant, and is inappropriate here. If you want to write complaints, please do at your own blog or send an email to the apprpriate standards committee. SO has guidelines for posting here; you can review those guidelines by reading the [help] section on "What kinds of questions should I ask here?" (and also the following section on what **not** to ask here). Good luck writing your own standard (and getting it globally accepted and put into use). — Ken White, Aug 02 '13 at 19:52
I've edited your question to be more constructive — in the future, however, as much as the technology may displease you, please refrain from devolving into a rant; we're here to answer questions, not read rants. — icktoofay, Aug 03 '13 at 04:09

score 1 · Answer 1 · answered Aug 02 '13 at 20:00

1

When you’re first starting out trying to understand something, it’s really frustrating. We’ve all been there. But while it’s very easy to call it stupid and everyone who made it stupid, you’re not going to get very far doing that. With an attitude like that, you’re implying that people who do understand it are also stupid for wasting their time on something so obviously stupid. After calling the people who do understand it stupid, it’s extremely unlikely that anyone who does understand it will take the time to explain it to you.

I understand the frustration. Unicode’s really complicated and it was a huge pain for me before I understood it and it’s still a pain for a lot of things I don’t have experience with. But the reason it’s so complicated isn’t because the people who made it were stupid and trying to ruin your life. It’s complicated because it attempts to provide a standard way of representing every human writing system ever used. Writing systems are insanely complicated, and throughout history developing a new and different writing system has been a fairly standard part of identifying yourself as a different culture from the people across the river or over the next mountain range. You yourself start off by identifying yourself as Hungarian based on the language you speak. Having once tried to pronounce a Hungarian professor’s name, I know that Hungarian is very complicated compared to English, just as English is very complicated compared to Hungarian. How would you feel if I was having trouble with Hungarian and asked you, “Boy, Hungarian sure is a stupid language! It must have been designed by idiots! By the way, how do I pronounce this word??”

There’s just no simple way to express something that’s inherently complicated in a very simple way. Human writing systems are inherently complicated and intentionally different from each other. As complicated as Unicode is, it’s better than what people had to do before, when instead of one single complicated standard there were multiple complicated standards in every country and you’d have to understand all of the different ‘standards.’

I’m not sure what your general life strategy is, but what I usually do when I don’t understand something is to pick up a few textbooks on the topic, read the textbooks through, and work out the examples. A good textbook will not only tell you how things are and what you need to do, but also how they go to be that way and why you need to do what you need to do.

I found Unicode Demysitifed to be an excellent book, and the newer book Unicode Explained has even higher ratings on amazon.

answered Aug 02 '13 at 20:00

andrewdotn

32,721
10
101
130

I like natural languages because they are nice and unique. But they weren't created. They were evolved. That is a great difference. The problem with Unicode is that it was created so it could be simple. But it isn't. A few problems: – simplesoft55 Aug 03 '13 at 08:36
Sorry the end of my last comment is missing. – simplesoft55 Aug 03 '13 at 08:38
Sorry, again. :) So a few problems with Unicode: I understand that writing systems are complicated but do we really need combining characters for example? Ok, maybe very-very-very rarely we need a latin letter + an acute combination which isn't used in any natural languages. But I think this is so rare that Unicode shouldn't support this. There are other ways to deal with it. Both users and programmers life would be much easier. And the Unicode standard is created for them. And standards should make our lives easier, right? – simplesoft55 Aug 03 '13 at 08:54
Another problem: I'm wondering how many characters will be displayed in an Edit component on a Form if I set its Text property to a string that contains an invisible character plus 25 combining character. And I'm also wondering how can the user edit this text. And what will be in the string if he edits the visible text? The number of problems are infinite... – simplesoft55 Aug 03 '13 at 09:02
And I think there isn't any program which perfectly supports Unicode. – simplesoft55 Aug 03 '13 at 09:08
And I think if no-one supports a standard perfectly then something is wrong with that standard. – simplesoft55 Aug 03 '13 at 09:24
And one more thing. I don't think that anyone who understands Unicode is stupid. I didn't want to hurt anyone here. After all, I also have to understand Unicode because I have no other choice. But I still think it is bad. But anyone who thinks otherwise: please, help me to solve the problems I mentioned satisfactorily! – simplesoft55 Aug 03 '13 at 15:25
Ok, once again. I asked a few questions but nobody answered them. I was only criticized because I criticized the Unicode standard. But if I'm wrong and Unicode is perfect and the only problem is me and my attitude then why nobody could answer any of my questions? – simplesoft55 Aug 07 '13 at 20:03
@simplesoft55 You’re right, the only problem is you and your attitude. I answered your question: before anyone will be able to answer any of your Unicode questions, you will need to read one of the linked textbooks on Unicode so that you understand the basics of Unicode, and you will need to drop the attitude. Your attitude scares away people who might help you. Suppose I spend a bunch of time explaining the padding of Unicode strings. Based on your past behaviour, the most likely outcome is that you will complain that the answer is too complicated and therefore I am stupid. Why bother? – andrewdotn Aug 08 '13 at 05:00
@simplesoft55 If you would like help with your attitude—here, this is a book that can help you a lot, for $10 shipped right to your door in Hungary: [How to Win Friend and Influence People](http://bit.ly/htwfaip-abebooks). If you don’t have $10 email me your address and I will order it for you. – andrewdotn Aug 08 '13 at 05:00
andrewdotn: Thank you for your help! I will buy that book but I'm sure it won't solve all of my problems. If I type the following text to different programs: an "a" letter + a zero width space (U+200B) + a combining diaeresis (U+0308) then it will give very different results. This becomes more complicated when I add some other combining diaeresis characters to the end of the string. And I bet these special cases aren't explained in any book. Am I wrong? If not then please, don't say again, that Unicode is good, because it is a standard and it should be supported in the same way by everyone! – simplesoft55 Aug 08 '13 at 09:34
@simplesoft55 Putting a combining character on a zero-width space is covered by [Section 2.11 of the standard](http://stackoverflow.com/q/14438785): “All combining characters can be applied to any base character and can, in principle, be used with any script … all sequences of character codes are permitted. This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.” – andrewdotn Aug 10 '13 at 04:19
@simplesoft55 That is, there is a general principle of Unicode: While Unicode has a lot of stuff that you can in theory put together in lots of creative ways—all that is supported in practice is what’s used by actual human writing systems. The standard says that while programs have to not crash when there’s a combining character on a zero-width space, programs are free to do whatever they want in terms of rendering. Defining behaviour on this edge case, when nobody needs this to communicate their language, is much less important than handling all the writing systems that aren’t yet in Unicode. – andrewdotn Aug 10 '13 at 04:26
andrewdotn: This means that Unicode is a standard that can be supported in different ways. Then why do we call it a standard? :) I think the only thing I can do is create an imperfect Unicode Pad function with lots of extra work. :( Thanks for your help and patience! – simplesoft55 Aug 10 '13 at 22:57

How do I pad a Unicode string to a specific visible length?

1 Answers1