read and write utf8 string to file

Question

If I have UTF8 encoded string in C (basically a char -or unsigned char?- array), and I want to write and read it from file (say in binary mode). Is there anything different I need to do with it, as compared to if I were writing/reading just ASCII characters?

What OS? On a *nix operating system it's no different, on Windows... that's a whole different thing — Mgetz, Dec 03 '13 at 21:21
If you are just reading and writing without actually processing it in any way, there is nothing special - if you just treat it as binary data. — user2802841, Dec 03 '13 at 21:22
@Mgetz - Why do you think Windows is any different for this? — Roddy, Dec 03 '13 at 21:26
@Mgetz: why? The only difference is in handling line ends, and those are by definition untouched by UTF8. — Jongware, Dec 03 '13 at 21:27
what about processing? is it any big deal? i think it should be no problem to do string copying or getting nr of bytes... probably i won't need to do other manipulations on the string later — , Dec 03 '13 at 21:27
@Jongware actually the difference is that windows doesn't understand UTF-8 and assumes everything is in the local code page, so it's a HUGE deal — Mgetz, Dec 03 '13 at 21:27
no this is not windows.ps. I will need a unsigned char array for utf8 string it seems right? — , Dec 03 '13 at 21:28
@Mgetz - Why do you think Windows itself cares about what you store in your files? Some apps that run on Windows can open text files (eg Notepad), and most allow you to specify the the Encoding at the time you open the file. But none of that impacts how your *write* the file, AFIAK. — Roddy, Dec 03 '13 at 21:32
Doesn't matter if signed or unsigned, you're not performing calculations with it. — Petr Skocik, Dec 03 '13 at 21:34
Thor unsigned I thought because I thought some bytes in UTF8 could be more than 127, I was wrong? — , Dec 03 '13 at 21:35
@user2568508 - Yes, but the binary representation of signed and unsigned char in a file is identical. signed vs unsigned is just a matter of interpretation, not representation. — Roddy, Dec 03 '13 at 21:36
I am confused a bit. Say I have value 210 decimal. How does that fit inside one char value? — , Dec 03 '13 at 21:37
The usual way,of course! No matter if it's unsigned or signed, you are guaranteed to have 8 freely assignable bits per each char. — Jongware, Dec 03 '13 at 21:39
Jongware, I don't know maybe I am missing smth? I knew char range was: -127,128 (http://stackoverflow.com/questions/3898688/range-of-signed-char) — , Dec 03 '13 at 21:42
Yes, but the sign bit is also under your control... As Roddy says, it's just a matter of interpretation. For math it is of *profound* importance, for text it's more whatever is convenient for you. — Jongware, Dec 03 '13 at 21:44
@user2568508 210 is 11010010 as an unsigned char. The same binary representation 11010010 is -46 as a *signed* char. A byte has 8 bits regardless, you just have different ways of doing sums on it. — Roddy, Dec 03 '13 at 21:46
I think I am not following you, I thought if I had values more than 127 I could not fit them in signed char. — , Dec 03 '13 at 21:46
210 in binary is 11010010. If you interpret that as an unsigned char, it works like it normally does in math. 0 + 2 + 0*4+... . If you interpret that as a signed char the first bit'll mean its a negative number and the interpretation will be according to the 2's complement convention (which'll effectively return -46). — Petr Skocik, Dec 03 '13 at 21:46
Maybe you should start another thread for that, if you're new to this. — Petr Skocik, Dec 03 '13 at 21:48
ok I see, I will probably just use unsigned char array to store the string, or see whichever works fine — , Dec 03 '13 at 21:49

Roddy · Answer 1 · 2013-12-03T21:41:33.057

Short answer: No, nothing different

Longer Answer: As always, it depends..

It depends on what you're going to use to read the file afterwards. If it's some other application, you may need to give it a hint that the file is UTF-8 encoded text, by sticking a UTF-8 BOM at the front. However, this is typically discouraged, so you can usually revert to the short answer!

However your comments imply your interested in processing the char array, rather than simply reading/writing it. Yes, you may need to do things differently, depending entirely on whay you want to do. Because a single 'unicode character' can be encoded as multiple bytes in the array, the for some operations (counting word lengths in text, for instance) you would need to be aware of the multi-byte characters. But because all the 'extra' bytes in UTF8 have the high bit set, you're never going to get them mixed up with normal characters. So things like string search and replace are typically as per normal ASCII.

score 1 · Answer 2 · answered Dec 03 '13 at 21:28

1

If you're just ouputting it (no char counting or modifications), you shouldn't have to worry about it. On Linux with gcc, you can even put UTF8 inside of strings in your source, and it works fine.

E.g.:

 puts("řčšéíčšřáčéířáéíščřáéíčřáščéřáěéířěéčšě"); //Will work correctly on Linux

It's just that č, for example, won't be represented by a single char.

answered Dec 03 '13 at 21:28

Petr Skocik

58,047
6
95
142

what is the problem with modifications of the string? say I read it to unsigned char buffer. Later, I will do modifications with it as needed (taking into account it is utf 8), would it be a problem? – Dec 03 '13 at 21:32
1

No sure not. If you count on the fact that characters in UTF8 can be represented by various numbers of **chars**, and if you take the particular byte-by-byte representation of your string into account, you should be OK. – Petr Skocik Dec 03 '13 at 21:37

score 1 · Answer 3 · answered Dec 03 '13 at 21:33

1

As long as you are fine with not actually using the signs for math operations, you should be fine.

UTF8 expects at least 8 bits per character "unit", and C chars, signed or not, are guaranteed to have these. Nothing is different -- except, of course, when you have a habit of adding up "a" to "b" (a nonsense operation on text) or converting to and from integers (which is as okay as it is with "regular" ASCII text with occasional high ASCII characters, i.e., if you take care of conversions when they may happen, you should be fine).

With that out of the way: if you are planning to show your output, you might want to use the same type -- signed or unsigned -- as your output library.

If I have to output UTF8 to the screen console (OSX's Terminal window, which is fully capable of showing UTF8) I use regular char strings, so I can use standard stdlib and string functions.

answered Dec 03 '13 at 21:33

Jongware

22,200
8
54
100

thank you jongware I will not do math, just reading and writing - however after I read it to buffer, then I may want to manipulate the string a bit – Dec 03 '13 at 21:36
There may be specialized UTF8-enabled string libraries but the system has been thought out *very* well. If you only need basic operations (strcat, strchr analogs), rolling your own is *very* instructive. – Jongware Dec 03 '13 at 21:38
Hi jongware, yes I may need really basic operations like strcpy strlen, so I could roll them myself. Just I might need to use Unicode in my embedded C app and was thinking UTF8 could be proper encoding for that .. – Dec 03 '13 at 21:40
Absolutely! .. Hm. Perhaps not if *most* of your text is going to need 16-bit characters (or even 32-bit, with the full Chinese Unicode set). Then you can save time by internally processing everything in a larger size, and only convert when reading or writing. – Jongware Dec 03 '13 at 21:42
yes I think most letters of the alphabet I will need are three bytes in UTF8 – Dec 03 '13 at 21:43
Then convert to 16 bits on reading and back to UTF8 on writing. It saves you the whole counting-bytes hassle, but a drawback would be you'd need a 16-bit analog of all common string functions. – Jongware Dec 03 '13 at 21:46
I am not following why I should do the conversion to 16 bits? – Dec 03 '13 at 21:50

read and write utf8 string to file

3 Answers3