5

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)

I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Is there a better way?

Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.

// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
  iconv_t cd;
  const char from[] = "UTF-16LE";
  const char to[] = "UTF-8";

  cd = iconv_open(to, from);
  if (cd == (iconv_t)-1)
  {
    printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
           to, from, strerror(errno));
    return(-1);
  }

  // How much space do we need?
  // Guess that we need the same amount of space as used by src.
  // TODO: There should be a while loop around this whole process
  //       that detects insufficient memory space and reallocates
  //       more space.
  int len = sizeof(wchar_t) * (wcslen(src) + 1);

  //printf("len = %d\n", len);

  // Allocate space
  int destLen = len * sizeof(char);
  *dest = (char *)malloc(destLen);
  if (*dest == NULL)
  {
    iconv_close(cd);
    return -1;
  }

  // Convert

  size_t inBufBytesLeft = len;
  char *inBuf = (char *)src;
  size_t outBufBytesLeft = destLen;
  char *outBuf = (char *)*dest;

  int rc = iconv(cd,
                 &inBuf,
                 &inBufBytesLeft,
                 &outBuf,
                 &outBufBytesLeft);
  if (rc == -1)
  {
    printf("iconv() failed: %s\n", strerror(errno));
    iconv_close(cd);
    free(*dest);
    *dest = NULL;
    return -1;
  }

  iconv_close(cd);

  return 0;
} // iwcstombs_alloc()
Harvey
  • 5,703
  • 1
  • 32
  • 41

4 Answers4

6

Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,

iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.

Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one on most linux distros.

  • I did not know about this tool. This doesn't answer my question b/c I need to read/write the files programmatically, but knowing about this tool makes for easier test case generation. Thanks. – Harvey Aug 10 '09 at 14:57
  • The version of inconv on my FreeBSD system wanted `UTF-16` and `UTF-8` instead of `utf16` or `utf8`. – Dan Pritts Dec 03 '13 at 20:33
  • View without saving: `iconf -f utf16 -t utf8 YOURFILE | less` – Luc Feb 12 '18 at 20:21
4

(Does Windows always use UTF-16? e.g. in Japanese versions)

Yes, NT's WCHAR is always UTF-16LE.

(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)

However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.

The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.

Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.

Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • WCHAR in windows seems to have a fixed size (you could do sizeof() on it). Doesn't that mean it only implements a subset of UTF-16, which is variable size? – PolyThinker Feb 09 '09 at 09:21
  • 1
    It stores 16-bit values corresponding to UTF-16 code points; if you want characters outside the BMP you have to use the surrogates manually, Windows won't help you. eg. ''.length==2. This is the same situation as eg. Java, or Python in narrow-Unicode mode. – bobince Feb 09 '09 at 13:14
  • After lots of experiments and using the knowledge of this answer, I used libiconv. I'm adding the simple function I used here for others to use. It's not perfect and I encourage others to fix problems. – Harvey Aug 10 '09 at 15:00
  • @Harvey did you have any luck using libiconv on japanese (ShiftJIS) strings? the current sources seem to indicate it's not supported. – Ofek Shilon Jun 09 '17 at 07:00
  • Never got that far, sorry. – Harvey Jun 09 '17 at 14:16
3

You can read as binary, then do your own quick conversion: http://unicode.org/faq/utf_bom.html#utf16-3 But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.

Mihai Nita
  • 5,547
  • 27
  • 27
  • Thanks for the hint. My boss was using those functions you pointed to, but we switched to libiconv since it makes handling different to/from encoding sets easy. – Harvey Aug 10 '09 at 15:11
1

I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • I didn't have much choice at first since my boss was the one who wrote the broken code. I've since helped him to see things differently and now we'll be using UTF-8 for all stored data. – Harvey Aug 10 '09 at 14:58