How to read a UTF-8 file in binary mode and transform it into a UTF-16 file using C

Question

I'm kind of new to this Unicode world, and I have no idea how to this using C. I'm not on a *nix system. I'm using fedora linux. I tried opening the UTF-8 file in binary mode, then reading each byte into an integer and then converting it to the corresponding Unicode Codepoint. But the thing is, how can I write this integer I got into a text file using UTF-16 format.

The resultant UTF-16 output file must be identical as the UTF-8 file it just read, but in the UTF-16 format. Can any one help me with that? Should I start by reading the UTF-8 file to an integer? Because i'm having trouble reading it otherwise. I know my code is a bit messy, i'm working on trying to make it better. Thanks in advance!

score 1 · Answer 1 · answered Apr 21 '17 at 18:25

First you have to make sure you understand the difference between a character and a codepoint. On that subject I suggest you read this article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Then you can use the ConvertUTF library available here. A piece of warning though as this library does not seem supported by unicode.org anymore.

In your case you want to convert from UTF8 to UTF16 so you should use the function "ConvertUTF8toUTF16" which takes an input buffer of UTF8 (unsigned char) and returns an output buffer of UTF16 (unsigned short).

So down to your question: You should read your input UTF8 file as a buffer of unsigned char and write to your output UTF16 file as a buffer of unsigned short. Be mindful of the endianness.

A last piece of warning: In the Microsoft world, "Unicode" and UTF16 are often equated but in this context, the definition of "Unicode" is actually UCS-2 most of the time.

Please post your code to pastebin, and include a link in your question, so I can better understand what you are doing. — David Jeske, Apr 21 '17 at 19:52

score 0 · Answer 2 · edited May 23 '17 at 12:17

0

One solution is to use a Unicode library, such as ICU-TC, which will do the work for you.

To avoid the library dependency, and convert yourself, you will want to read and convert from the variable-length UTF-8 encoding into 32 bit unsigned integers, and then convert the 32bit integers to UTF-16's variable length encoding of 16bit-values.

You will need to open your output file for binary writing, with:

FILE *outfile = fopen(filename,"wb");

UTF-16 may be written in either little or big endian ordering. To disambiguate, UTF-16 has a special byte-ordering code-point which you write first ( 0xFEFF ) The order these two bytes appear in the file tells the reader which endianness the file was written in. (see explanation in UTF-16 description on wikipedia) The code:

unsigned short int byte_ordering_sentinel = 0xFEFF;
fwrite(&byte_ordering_sentinel, 2, 1, outfile);

For each 32-bit integer, you will need to follow the UTF-16 rules to produce variable length UTF-16 values. For each 16-bit UTF-16 value, you would do:

fwrite(&next_utf16_value, 2, 1, outfile);

NOTE 1: Endianness is a product of your CPU and Operating System. Intel CPUs are always little endian. ARM CPUs can do either, and are little endian under Android. If you wish to change the endianness of the output, you need to byte-swap each 16-bit value before it's written. Be sure to also byte-swap the initial byte_ordering_sentinel.

On linux, you can byte swap efficiently using macros in byteswap.h.

NOTE 2: When using fgetc() it is important to check for an EOF value. There could be a race condition between your feof(arq) check and your fgetc() call, if someone changes the file while your program is running. Your loop could instead look like this:

while ( (num=fgetc(arq)) != EOF )

edited May 23 '17 at 12:17

Community

1
1

answered Apr 21 '17 at 18:30

David Jeske

2,306
24
29

I'm trying to do it all manually, without this library. So far I'm reading the UTF-8 file into an array of 4 unsigned characters. But my question is, how can I manipulate this array to be able to get the Unicode Codepoint and the write the UTF-16 file with it. – Noda De Caju Apr 21 '17 at 18:59
Are you aware that UTF-8 characters are variable length? You need to "decode" the UTF-8 characters into a 32bit unsigned integer, using the rules of UTF-8 encoding (see https://en.wikipedia.org/wiki/UTF-8) That will give you the proper "code point" number. Once you have the codepoint in a 32bit unsigned integer, you'll need to "encode" the integer in UTF-16, using the UTF-16 encoding rules (see https://en.wikipedia.org/wiki/UTF-16) – David Jeske Apr 21 '17 at 19:54
All right, I modified my code. Now I have the unicode codepoint in an unsigned integer. From that, I'm going to transform it into an UTF-16 format integer. But how can I write it to an output .txt file in binary mode? – Noda De Caju Apr 21 '17 at 20:02
Thank you! One last question, im using linux gcc compiler, but everytime I compile my code the output files end up in little endian. I'm using a 64 bit intel. Is there a way to tell the compiler to use big endian instead of just changin everything into little endian. Aside from switching every bite order to counter the compiler, how can I avoid that? – Noda De Caju Apr 21 '17 at 22:15
Endian-ness is a product of the CPU and Operating System you are using, not the compiler. Intel CPUs are little endian. ARM CPUs can do both, and are little-endian under Android. If you wish to write the file as big-endian on an intel CPU, you'll need to byte-swap every 16bit value before you write it out. – David Jeske Apr 21 '17 at 22:58

How to read a UTF-8 file in binary mode and transform it into a UTF-16 file using C

2 Answers2