4

I've got a bit of a tedious 6-months to a year ahead of me. I'm working on a program with 1 million+ lines of code (much of it written in the early/mid 90's) and it has been decided that it should now support a UNICODE build. I've researched and found many of the best practices:

  • using the _t version of many microsoft and C++ methods like _stprintf_s() instead of sprintf_s() or _tcsstr() instead of strstr(),
  • wrapping all coded strings that need to be TCHAR* like so _T("string") or _T('c'),
  • replacing most char* with LPTSTR and most const char* with LPCTSTR and char with TCHAR using CA2T() and CT2A() to convert between char* and LPTSTR if necessary,

I was wondering if anyone has written a script that is capable of automatically making many of these changes, since they could save me MONTHS of work.

Macke
  • 24,812
  • 7
  • 82
  • 118
Alex Londeree
  • 181
  • 1
  • 6
  • 1
    I think this is a help: http://mihai-nita.net/2007/12/19/tounicode-automating-some-of-the-steps-of-unicode-code-conversion-windows/ – chris Jun 14 '12 at 16:41
  • 2
    If it is real upgrade, and doesn't need to be multi-byte anymore, you should skip all the `_t` stuff and go directly to `wchar_t`. The `_t` and `_T` was designed as a (temporary) aid some 15 years ago. – Bo Persson Jun 14 '12 at 18:15
  • But is there a reason not to use _t and _T? As far as I can tell, _T("") is still required for setting wchar_t strings and using LPWSTR vs LPTSTR doesn't seem to take any extra effort. – Alex Londeree Jun 14 '12 at 21:56
  • @Alex: No, `wchar_t` strings use `L"..."` literals. – dan04 Jun 14 '12 at 23:14
  • 1
    `_T("")` maps to `L""` when `_UNICODE` is defined. The only reason to use `TCHAR` and related functions vs `wchar_t` and related functions is if you need to produce both ANSI and UNICODE builds from the same source code. If you need to maintain ANSI support, use `TCHAR` and related. If you are going to full UNICODE only, use `wchar_t` and related. Better with a Unicode framework instead, such as ICONV or ICU, as Unicode is hard to get right. It is not enough to just change data types, sometimes you have to change program logic to account for logical differences in how ANSI and UNICODE work. – Remy Lebeau Jun 14 '12 at 23:21
  • 1
    Yes, it's easy to change your Windows API calls to use UTF-16. But are you going to change your file formats to use UTF-16? And if you rely on any non-Microsoft libraries, do they **all** support UTF-16? What if they're as behind on Unicode support as your product is? Zlib, for example, didn't support `wchar_t*` filenames until [a StackOverflow user requested it](http://stackoverflow.com/questions/9717068/using-zlib-with-unicode-file-paths-on-windows) 3 months ago. – dan04 Jun 15 '12 at 02:18
  • 1
    On a similar note, OpenSSL still doesn't support Unicode filenames at all on Windows. Other platforms use Ansi or UTF-8 filesystems, so OpenSSL handles them OK with its use of `char*`-based filenames. But on Windows, the open-source [Indy library](http://www.indyproject.org) (which I work on) ended up having to write its own set of functions that are basically copies of OpenSSL's code but adjusted to use `wchar_t*`-based filenames for supporting UTF-16. – Remy Lebeau Jun 16 '12 at 19:07
  • @RemyLebeau: "Other platforms use Ansi" No. ANSI is a purely Microsoft term for its set of proprietary code pages that were supposed to be standardized by ANSI but then ISO was first to standardize similar (but not identical) code pages. So other platform use ISO. – Yakov Galka Jun 17 '12 at 06:33
  • Either way, `char*` can handle ISO/ANSI encodings, so OpenSSL happily supports those encodings, as it just passes the `char*` values as-is to platform APIs that also take `char*` values. On Windows, that eliminates any possibility of supporting UTF-16. That is the point I was trying to make. – Remy Lebeau Jun 17 '12 at 09:58

1 Answers1

4

I think this approach exactly fits your scenario.

Leave all your strings be narrow chars, use sprintf and strstr as before, read and write from text files that are always assumed to be UTF-8 without BOMs, etc... All you need to change is your communication with the system. Just assume now that the strings are UTF-8 and before calling into MFC or Windows, convert to UTF-16 on-the-fly.

As a bonus, you'll get easier portability to non-Windows platforms, compared to the approach advocated by Microsoft.

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220