4

I am trying to use ICU libraries to test if a string has invalid UTF-8 characters. I created a UTF-8 converter but no invalid data gives me an error on conversion. Appreciate your help.

Thanks, Prashanth

int main()                                                                                        
{                                     
    string str ("AP1120 CorNet-IP v5.0 v5.0.1.22 òÀ MIB 1.5.3.50 Profile EN-C5000");
    //  string str ("example string here");
    //  string str (" ����������"     );                  
    UErrorCode status = U_ZERO_ERROR;                   
    UConverter *cnv;            
    const char *sourceLimit;    
    const char * source = str.c_str();                  
    cnv = ucnv_open("utf-8", &status);                                                              
    assert(U_SUCCESS(status));                                                                      

    UChar *target;                                                                                  
    int sourceLength = str.length();                                                                
    int targetLimit = 2 * sourceLength;                                                             
    target = new UChar[targetLimit];                                                                

    ucnv_toUChars(cnv, target, targetLimit, source, sourceLength, &status);
    cout << u_errorName(status) << endl;
    assert(U_SUCCESS(status));                          
}       
informatik01
  • 16,038
  • 10
  • 74
  • 104
user1245457
  • 43
  • 1
  • 3
  • Not familiar with this library but seems to me if you open your converter with `"utf-8"` and then call `ucnv_toUChars` to convert aren't you more or less telling it to convert from Unicode to Unicode? It may short circuit with success in this case. I'd try opening it with a iso encoding or something else. – AJG85 Mar 02 '12 at 20:14

2 Answers2

7

I modified your program to print out the actual strings, before and after:

#include <unicode/ucnv.h>
#include <string>
#include <iostream>
#include <cassert>
#include <cstdio>

int main()
{
    std::string str("22 òÀ MIB 1");
    UErrorCode status = U_ZERO_ERROR;
    UConverter * const cnv = ucnv_open("utf-8", &status);
    assert(U_SUCCESS(status));

    int targetLimit = 2 * str.size();
    UChar *target = new UChar[targetLimit];

    ucnv_toUChars(cnv, target, targetLimit, str.c_str(), -1, &status);

    for (unsigned int i = 0; i != targetLimit && target[i] != 0; ++i)
        std::printf("0x%04X ", target[i]);
    std::cout << std::endl;
    for (char c : str)
        std::printf("0x%02X ", static_cast<unsigned char>(c));
    std::cout << std::endl << "Status: " << status << std::endl;
}

Now, with default compiler settings, I get:

0x0032 0x0032 0x0020 0x00F2 0x00C0 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xC3 0xB2 0xC3 0x80 0x20 0x4D 0x49 0x42 0x20 0x31

That is, the input is already UTF-8. This is a conspiracy of my editor that saved the file in UTF-8 (verifiable in a hex editor), and of GCC that sets is execution character set to UTF-8.

You can coerce GCC to change those parameters. For example, forcing the execution character set to ISO-8859-1 (via -fexec-charset=iso-8859-1) produces this:

0x0032 0x0032 0x0020 0xFFFD 0xFFFD 0x0020 0x004D 0x0049 0x0042 0x0020 0x0031
0x32 0x32 0x20 0xF2 0xC0 0x20 0x4D 0x49 0x42 0x20 0x31

As you can see, the input is now ISO-8859-1-encoded, and the conversion prompty fails and produces "invalid character" code points U+FFFD.

However, the conversion operation still returns a "success" state. It appears that the library doesn't consider a user data conversion error an error of the function call. Rather, the error status seems to be reserved for things like running out of space.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • Interesting, my guess was somewhat close then. +1 for experimenting. I was about to come back with a post saying `ucnv_getInvalidUChars` may be more what the OP wants but it may be better in your answer, if applicable. – AJG85 Mar 02 '12 at 21:05
  • Thanks for your answer, it now makes sense why conversion wasn't failing. For testing purposes if I want to continue to use the default character set of gcc, is it possible to save the input in such a way that it is saved in its original form and not in the UTF-8 form? – user1245457 Mar 05 '12 at 21:51
  • @user1245457: There's no input in the example, only hardcoded data in the source code. Nothing happens to the actual *input*, which is just an opaque byte stream and which you can save at will. – Kerrek SB Mar 05 '12 at 21:57
  • @KerrekSB - if I continue to use the default character set of gcc, how do I detect an non-UTF8 input? regardless of the input string, if the default character set is used, all input strings are flagged as valid – user1245457 Mar 06 '12 at 16:27
  • You don't. Rather, you write in your manual, "the input must be encoded in UTF-8". Input is just a sequence of dumb numbers, and if you don't agree on what they mean, you can't just guess. The better solution would be to use the environment's locale setting (see ``), and do the whole `mbsrtowcs`- followed by iconv-WCHAR-to-UTF8 conversion [that I keep going on about](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability)... – Kerrek SB Mar 06 '12 at 18:21
  • @KerrekSB, I just ran your example above for the Chinese character: 斯 and this is my output: first line: 0x65AF second line: 0xE6 0x96 0xAF It looks like the second line is UTF-8 but what is the first? Is it UTF-16? Thank you! – Caroline Beltran Oct 16 '14 at 18:59
  • @CarolineBeltran: Wouldn't that depend on your source and execution character sets, and your editor? – Kerrek SB Oct 16 '14 at 19:01
  • I'm Using VC++ 2013 and saving the source file using Notepad++ in UTF-8 format. I'm just confused as to the format being used by VC++ 2013 when encountering std::string str("斯"); – Caroline Beltran Oct 16 '14 at 19:08
  • @CarolineBeltran: You need to ask your editor, or look at your source file in a hex editor. I can't possibly know :-( – Kerrek SB Oct 16 '14 at 19:10
  • @KerrekSB, I found a site that converts strings to both UTF-8/16 and converted the Chinese character that I mentioned above and got the same results produced by your code snippet. This was really helpful to me because I can see that even though I hardcode a character into VC++ 2013 and save the source file as UTF-8, VC++ will still treat the hardcoded strings as UTF-16. I've never used a Linux compiler but from what I've read, I'd think that a Linux user would never have this issue. – Caroline Beltran Oct 16 '14 at 19:28
0

I use this code. I detect all charsets for my string and test one by one if charsetname == "UTF-8". True is valid UTF-8 breaks the loop and finally return the result, False if no charsets found or chatsetnames not equals to "UTF-8".

References

ICU documentation ChatsetDetector

Code

#include <iostream>
#include <string>
#include <unicode/ucsdet.h>

#define UTF8_CHARSET_NAME_STRING ("UTF-8"s)

using namespace std::string_literals;

bool IsValidUTF8(const std::string &data)
{
    UErrorCode status = U_ZERO_ERROR;
    UCharsetDetector *detector = ucsdet_open(&status);
    ucsdet_setText(detector, data.c_str(), data.length(), &status);
    int32_t detectedNumber = 0;
    auto matches = ucsdet_detectAll(detector, &detectedNumber, &status);
    if (!matches)
    {
        return false;
    }
    bool valid = false;
    for (int32_t i = 0; i < detectedNumber; i++)
    {
        const char *charsetName = ucsdet_getName(matches[i], &status);
        if (UTF8_CHARSET_NAME_STRING == charsetName)
        {
            valid = true;
            break;
        }
    }
    ucsdet_close(detector);
    return valid;
}

int main()
{
    String strData = {(char)0xff, 0x25, 0x00, (char)0xfa, (char)0xff,(char)0xff,(char)0xff}; 
    std::cout<< "Result: " << (IsValidUTF8(strData)? ("true; Original String : \"" + strData + "\"") : "false") <<std::endl;  
    
    strData = "HelloWorld!!!";    
    std::cout<< "Result: " << (IsValidUTF8(strData)? ("true; Original String : \"" + strData + "\"") : "false") <<std::endl;                        
    return EXIT_SUCCESS; // 0
}

Output

output

Result: false
Result: true; Original String : "HelloWorld!!!"
Joma
  • 3,520
  • 1
  • 29
  • 32