10

In the project I'm working on, I deal with quite a few string manipulations; strings are read from binary files along with their encoding (which can be single or double byte). Essentially, I read the string value as vector<char>, read the encoding and then convert all strings to wstring, for consistency.

This works reasonably well, however the filenames themselves can be double-byte chars. I'm totally stumped on how to actually open the input stream. In C I would use _wfopen function passing wchar_t* path, but wifstream seems to behave differently, as it's specifically designed for reading double-byte chars from a file, not for reading single bytes from a file with double-byte filename.

What is the solution to this problem?

Edit: Searching the net, it looks like there's no support for this at all in standard C++ (e.g. see this discussion). However I'm wondering if C++11 actually adds something useful in this area.

Aleks G
  • 56,435
  • 29
  • 168
  • 265
  • I would avoid using `wchar_t` and `wstring` because `wchar_t` is not portable across compilers (it's 16 bits in VC++ but 32 bits in gcc). C++11 introduces `char16_t` and `char32_t` though obviously you can `typedef` them yourself. – Matthieu M. Jan 04 '13 at 13:31
  • @Matthieu M. I'm not too worried about VC++, as it's not one of my target compilers, anyway. I need to get the code working on unix-based systems first. – Aleks G Jan 04 '13 at 13:34
  • Here is the same question but for windows only: [How to open an std::fstream (ofstream or ifstream) with a unicode filename?](http://stackoverflow.com/q/821873/33499) – wimh Jan 04 '13 at 13:35
  • 2
    In Unix systems there is no point in using anything other than UTF-8 internally at all. In particular on Linux you can just pass a UTF-8 string to `open` directly. – filmor Jan 04 '13 at 13:35
  • @filmor Ok, point taken. I haven't dealt with utf-8 strings in c++ up until now, only working with `wstring`. Should I implement a subclass of `string`, something like `utf8string` to wrap all the conversion? Or is there an easier way? – Aleks G Jan 04 '13 at 13:38
  • No. Just use `std::string` with UTF-8 data, there are a lot of functions that convert from and to UTF-8 around (also here on stackoverflow). You should use the data `const` though. An IMHO interface-wise good implementation is http://utfcpp.sourceforge.net/. Didn't use this so far though. – filmor Jan 04 '13 at 13:43
  • If the filename is really UTF-16, then passing UTF-8 to `open` won't find it. The real question is where these files come from; it is impossible to create UTF-16 encoded filenames under Unix. – James Kanze Jan 04 '13 at 14:01
  • Well, of course you'd first convert UTF-16 to UTF-8 and then pass the result to `open`… – filmor Jan 04 '13 at 15:00
  • 1
    C++11 *IS* standard C++. C++03 is replaced, cancelled, withdrawn and no longer a standard. – MSalters Jan 04 '13 at 16:32

1 Answers1

1

How the string you pass to open is mapped to a filename is implementation dependent. In a Unix environment, it is passed almost literally—only '/' and '\0' are treated specially. In other environments, other rules rule, and I've had problems in the past because I'd written a file in Unix, and couldn't do anything with it under Windows (which treats a ':' in the filename specially).

Another question is where these files come from. As mentionned above, there may be absolutely no way of opening them on your system: a filename with a ':' simply cannot be opened in Windows. In Unix, if you end up with '\0' characters in the filename itself, you probably can't read them either, and UTF16 filenames will appear to have '\0' characters in them under Unix. You're only solution may be to use native tools on the system which generated the files to rename them.

It's less clear to me how you could get such filenames on a Unix disk in the first place. How does an SMB server such as Samba map UTF16 filenames when it is serving on a Windows box? Or an NFS server—I think such things also exist under Windows.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • In Linux the mapping of filenames to UTF-8 (the standard codepage) is done by the driver, which can often be configured (i.e. for cifs (smb) using the mount-option `iocharset`). – filmor Jan 04 '13 at 14:59