11

how can I convert a wchar_t ('9') to a digit in the form of an int (9)?

I have the following code where I check whether or not peek is a digit:

if (iswdigit(peek)) {
    // store peek as numeric
}

Can I just subtract '0' or is there some Unicode specifics I should worry about?

Lasse Espeholt
  • 17,622
  • 5
  • 63
  • 99

5 Answers5

6

Look into the atoi class of functions: http://msdn.microsoft.com/en-us/library/hc25t012(v=vs.71).aspx

Especially _wtoi(const wchar_t *string); seems to be what you're looking for. You would have to make sure your wchar_t is properly null terminated, though, so try something like this:

if (iswdigit(peek)) {
    // store peek as numeric
    wchar_t s[2];
    s[0] = peek;
    s[1] = 0;
    int numeric_peek = _wtoi(s);
}
Lasse Espeholt
  • 17,622
  • 5
  • 63
  • 99
Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • I have seen those, but it seems a little stupid to convert it to a `string`, and after that, convert it to an `int`. But if that is the usual way to do it, I guess I'll do that :) – Lasse Espeholt May 20 '11 at 07:39
  • True, but do you really want to duplicate this kind of logic? You would have to be sure you know all there is to know about unicode. Or at least enough to be sure you're not messing up. I personally wouldn't risk it. – Daren Thomas May 20 '11 at 07:41
  • I won't either. I just thought there was a method to do it. I see the boost library does it. +1 – Lasse Espeholt May 20 '11 at 07:43
  • `boost::lexical_cast` just passes the problem on to iostreams, and iostreams don't know anything about Unicode. So the logic he would not be duplicating is probably broken with respect to what he wants to do. – James Kanze May 20 '11 at 09:09
6

If the question concerns just '9' (or one of the Roman digits), just subtracting '0' is the correct solution. If you're concerned with anything for which iswdigit returns non-zero, however, the issue may be far more complex. The standard says that iswdigit returns a non-zero value if its argument is "a decimal digit wide-character code [in the current local]". Which is vague, and leaves it up to the locale to define exactly what is meant. In the "C" locale or the "Posix" locale, the "Posix" standard, at least, guarantees that only the Roman digits zero through nine are considered decimal digits (if I understand it correctly), so if you're in the "C" or "Posix" locale, just subtracting '0' should work.

Presumably, in a Unicode locale, this would be any character which has the general category Nd. There are a number of these. The safest solution would be simply to create something like (variables here with static lifetime):

wchar_t const* const digitTables[] =
{
    L"0123456789",
    L"\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669",
    // ...
};

//!     \return
//!         wch as a numeric digit, or -1 if it is not a digit
int asNumeric( wchar_t wch )
{
    int result = -1;
    for ( wchar_t const* const* p = std::begin( digitTables );
            p != std::end( digitTables ) && result == -1;
            ++ p ) {
        wchar_t const* q = std::find( *p, *p + 10, wch );
        if ( q != *p + 10 ) {
            result = q - *p;
    }
    return result;
}

If you go this way:

  1. you'll definitely want to download the UnicodeData.txt file from the Unicode consortium ("Uncode Character Database"—this page has a links to both the Unicode data file and an explination of the encodings used in it), and
  2. possibly write a simple parser of this file to extract the information automatically (e.g. when there is a new version of Unicode)—the file is designed for simple programmatic parsing.

Finally, note that solutions based on ostringstream and istringstream (this includes boost::lexical_cast) will not work, since the conversions used in streams are defined to only use the Roman digits. (On the other hand, it might be reasonable to restrict your code to just the Roman digits. In which case, the test becomes if ( wch >= L'0' && wch <= L'9' ), and the conversion is done by simply subtracting L'0'— always supposing the the native encoding of wide character constants in your compiler is Unicode (the case, I'm pretty sure, of both VC++ and g++). Or just ensure that the locale is "C" (or "Posix", on a Unix machine).

EDIT: I forgot to mention: if you're doing any serious Unicode programming, you should look into ICU. Handling Unicode correctly is extremely non-trivial, and they've a lot of functionality already implemented.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • +1 Thanks for the comprehensive answer :) I ain't doing any serious Unicode programming. I just wanted to make it Unicode-aware, but I guess it´s a very difficult task to do properly. – Lasse Espeholt May 20 '11 at 08:59
  • 2
    It depends how Unicode-aware you want to be. C++ and Java are officially Unicode-aware, but they still require numeric constants to be in Roman numbers; they're Unicode-awareness is limited to allowing Unicode characters in symbols and in string and character literals (and comments). I think that for a lot of programs, something like that is sufficient Unicode awareness. – James Kanze May 20 '11 at 09:12
2

You could use boost::lexical_cast:

const wchar_t c = '9';
int n = boost::lexical_cast<int>( c );
Kirill V. Lyadvinsky
  • 97,037
  • 24
  • 136
  • 212
  • 2
    This is outrageous overkill. Behind the scenes, you're creating an `std::ostreamstring` to convert the `wchar_t` into an `std::string`, then an `std::istringstream` to convert the `std::string` into an int, when all that is needed is a simple subtraction. – James Kanze May 20 '11 at 07:53
  • I would simply use if (peek >= L'0' && peek <= L'9') – Kirill Kovalenko May 20 '11 at 07:57
  • @James Kanze, if this is not a time critical part of the code I would write a code that easier to read rather than a code which will work in theory a bit faster. Besides, [you can specialize](http://stackoverflow.com/questions/1250795/very-poor-boostlexical-cast-performance/1251043#1251043) `boost::lexical_cast` for single `wchar_t` to make it work incredibly fast without using streams. – Kirill V. Lyadvinsky May 20 '11 at 08:23
  • @Kirill What's easier to read than a simple subtraction? In practice, I'd eschew `boost::lexical_cast` except to and from `std::string` (which I believe the `boost` people have optimized to only use a single `[io]stringstream`). It just doesn't seem appropriate. – James Kanze May 20 '11 at 08:42
  • @James Kanze, I totally agree with you that lexical_cast is overkill. I meant to say that I would use subtraction, but to avoid uncertainty I would change the iswdigit() to (peek >= L'0' && peek <= L'9') – Kirill Kovalenko May 20 '11 at 09:04
1

Despite MSDN documentation, a simple test suggest that not only ranger L'0'-L'9' returns true.

for(wchar_t i = 0; i < 0xFFFF; ++i)
{
    if (iswdigit(i))
    {
        wprintf(L"%d : %c\n", i, i);
    }
}

That means that L'0' subtraction probably won't work as you may expected.

Kirill Kovalenko
  • 2,121
  • 16
  • 18
  • In which locale? `iswdigit` is locale specific, so you can't make any statements about it without specifying the locale. – James Kanze May 20 '11 at 08:30
  • English or German. Can't say for sure. I have English box with some German settings. – Kirill Kovalenko May 20 '11 at 09:00
  • That doesn't necessarily affect your locale in the code. All programs start in "C" locale. – James Kanze May 20 '11 at 09:07
  • Are you sure that iswdigit depends on locale? MSDN says that: For iswdigit, the result of the test condition is independent of locale. – Kirill Kovalenko May 20 '11 at 09:12
  • I don't have my copy of the C standard here, but the Posix standard says "The iswdigit() function shall test whether wc is a wide-character code representing a character of class digit in the program's current locale;", and also says that "The functionality described on this reference page is aligned with the ISO C standard." This did sort of surprise me, because I remember distinctly that `isdigit` was the only narrow char `isxxx` function which was locale independent. (This may be a bug in the Posix standard, since it also says that `isdigit` is locale dependent.) – James Kanze May 20 '11 at 09:47
0

For most purposes you can just subtract the code for '0'.

However, the Wikipedia article on Unicode numerials mentions that the decimal digits are represented in 23 separate blocks (including twice in Arabic).

If you are not worried about that, then just subtract the code for '0'.

Ian Goldby
  • 5,609
  • 1
  • 45
  • 81
  • If those Unicode numerials is recognized by `iswdigit` then it could break my code. So I guess I've to worry about that :) – Lasse Espeholt May 20 '11 at 07:40
  • Unicode digit will break your code iff your current locale has some locale which doesn't use the ASCII/English standard numbers. – Raze May 20 '11 at 08:16