Character Encoding independent character swap

Question

I like to use this piece of code when I want to reverse a string._{[When I am not using std::string or other inbuilt functions in C]}. As a beginner when I initially thought of this I had ASCII table in mind. I think this can work well with Unicode too. I assumed since the difference in values (ASCII etc) is fixed, so it works.

Are there any character encodings in which this code may not work?

char a[11],t;
int len,i;
strcpy(a,"Particl");    
printf("%s\n",a);
len = strlen(a);
for(i=0;i<(len/2);i++)
{
    a[i] += a[len-1-i];
    a[len-1-i] = a[i] - a[len-1-i];
    a[i] -= a[len-1-i];
}
printf("%s\n",a);

_Update:

_{This link is informative in association with this question.}

You can have overflow if `char` is signed, that would be undefined behaviour. Just use a temporary for the swap. It will produce an invalid result for example if you have a UTF-8 encoded string with multi-byte code-points in it. — Daniel Fischer, May 14 '13 at 15:31
This character encoding does not swap characters on any encoding where a character may occupy more than one `char`, which is a byte (and practically always an octet). Note that this depends a bit on what you call a "character", but it goes wrong in some way or another for practically every sane notion of "character". In other words, it does not work in any character encoding you should be using. — , May 14 '13 at 15:33
Trying to sort out whether that code inside the loop actually swaps values gives me a headache. Just write it in the obvious way, with a temporary variable, so that future maintainers (including yourself) won't have to puzzle over it. — Pete Becker, May 14 '13 at 21:33
@Pete Ya actually I wanted to try something without using a temporary.So I thought, `a = a + b; b = a - b; a = a - b` would work out. I guess from @delnan 's comments I figure out,this has more to do with size of storage of the datatype than the encoding — Suvarna Pattayil, May 15 '13 at 07:17
@delnan I did not understand why you said `it does not work in any character encoding you should be using`. I want to know, I am new to character encoding and stuff. Did you mean it won't work in ALL possible encodings(might work in some, some not). OR it won't work in ANY existing encoding. — Suvarna Pattayil, May 15 '13 at 13:06
It only works in "single-byte" encodings, e.g. ISO 8859.1, KOI8-R, etc., in which one `char` (that is, one byte) is sufficient to represent all codepoints. — zwol, May 15 '13 at 15:33
@SuvP In that part specifically, I am saying the character encodings for which your code works are character encodings that should not be used. There are encodings for which your code works, but they are not desirable or useful. — , May 15 '13 at 16:56
@delnan It might be better to describe those encodings as "obsolete". The way forward is Unicode, and Unicode doesn't fit in a 1-byte fixed-width encoding no matter what, but ISO 8859.1 *was* quite useful back when it was popular. — zwol, May 16 '13 at 14:48
@Zack When I say "are not desirable or useful", I am of course talking about the present day. I'm sure they made sense back when they were created, but today, they are obsolete (as you say), precisely because there is no reason to use them any more. — , May 16 '13 at 16:33

zwol · Accepted Answer · 2013-05-14T15:52:39.717

9

This will not work with any encoding in which some (not necessarily all) codepoints require more than one char unit to represent, because you are reversing byte-by-byte instead of codepoint-by-codepoint. For the usual 8-bit char this includes all encodings that can represent all of Unicode.

For example: in UTF-16BE, the string "hello" maps to the byte sequence 00 68 00 65 00 6c 00 6c 00 6f. Your algorithm applied to this byte sequence will produce the sequence 6f 00 6c 00 6c 00 65 00 68 00, which is the UTF-16BE encoding of the string "漀氀氀攀栀".

It gets worse -- doing a codepoint-by-codepoint reversal of a Unicode string still won't produce the correct results in all cases, because Unicode has many codepoints that act on their surroundings rather than standing alone as characters. As a trivial example, codepoint-reversing the string "Spın̈al Tap", which contains U+0308 COMBINING DIAERESIS, will produce "paT länıpS" -- see how the diaeresis has migrated from the N to the A? The consequences of codepoint-by-codepoint reversal on a string containing bidirectional overrides or conjoining jamo would be even more dire.

edited May 14 '13 at 15:52

answered May 14 '13 at 15:32

zwol

135,547
38
252
361

Thanks for the example! Now, I got something in my head. – Suvarna Pattayil May 14 '13 at 15:39
Why do you say I am doing a **byte-by-byte** reversing rather than with codepoints? `a[x]` will be of type `char` (and of whatever size it is - 2 bytes or 1 byte ...). I am exchanging `a[x]` to another `a[x]` (assuming no overflows) using (general sense) `a = a+b; b = a-b; a = a - b;` Won't it swap the whole `h` with `o` [ `00 68` with `00 6f`]. ? Any links, explanations that can help me understand are welcome – Suvarna Pattayil May 15 '13 at 14:20
In C, except on very unusual ABIs, a `char` is *one 8-bit byte*. It is *not* one codepoint (as IIRC it is in Java?). Therefore, in any encoding that requires more than one *byte* to represent at least some codepoints, those codepoints will occupy more than one `char` in the string and your code won't work. – zwol May 15 '13 at 15:34
Java `char` is a UTF-16 code **unit**, i.e. a 16 bit value. That is not a Unicode code **point**. Characters outside the BPM take two Java `char`s in a Java `string`. – May 15 '13 at 16:57
@delnan Thanks for clarification; I've never actually learned Java myself. – zwol May 15 '13 at 17:02

Character Encoding independent character swap

1 Answers1

Linked