1

I'm using NSXMLParser for parsing XML to my app and having a problem with the encoding type. For example, here is one of the feeds coming in. It looks similar to this"

\U2026Some random text from the xml feed\U2026

I am currently using the encoding type:

NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];

Which encoding type am I suppose to use for converting \U2026 into a ellipse (...) ??

Romes
  • 3,088
  • 5
  • 37
  • 52

3 Answers3

1

The answer here is you're screwed. They are using a non-standard encoding for XML, but what if they really want the literal \U2026? Let's say you add a decoder to handle all \UXXXX and \uXXXX encodings. What happens when another feed want the data to be the literal \U2026?

You're first choice and best bet is to get this feed fixed. If they need to encode data, they need to use proper HTML entities or numeric references.

As a fallback, I would isolate the decoder away from the XML parser. Don't create a non-conforming XML parser just because your getting non-conforming data. Have a post processor that would only be run on the offending feed.


If you must have a decoder, then there is more bad news. There is no built in decoder, you will need to find a category online or write one up yourself.


After some poking around, I think Using Objective C/Cocoa to unescape unicode characters, ie \u1234 may work for you.

Community
  • 1
  • 1
Jeffery Thomas
  • 42,202
  • 8
  • 92
  • 117
1

Alright, heres a snippet of code that should work for any unicode code-point:

NSString *stringByUnescapingUnicodeSymbols(NSString *input)
{
    NSMutableString *output = [NSMutableString stringWithCapacity:[input length]];

    // get the UTF8 string for this string...
    const char *UTF8Str = [input UTF8String];

    while (*UTF8Str) {
        if (*UTF8Str == '\\' && tolower(*(UTF8Str + 1)) == 'u')
        {
            // skip the next 2 chars '\' and 'u'
            UTF8Str += 2;

            // make sure we only read 4 chars
            char tmp[5] = { UTF8Str[0], UTF8Str[1], UTF8Str[2], UTF8Str[3], 0 };
            long unicode = strtol(tmp, NULL, 16); // remember that Unicode is base 16

            [output appendFormat:@"%C", unicode];

            // move on with the string (making sure we dont miss the end of the string
            for (int i = 0; i < 4; i++) {
                if (*UTF8Str == 0)
                    break;
                UTF8Str++;
            }
        }
        else 
        {
            if (*UTF8Str == 0)
                break;

            [output appendFormat:@"%c", *UTF8Str];
        }


        UTF8Str++;
    }

    return output;
}
Richard J. Ross III
  • 55,009
  • 24
  • 135
  • 201
0

You should simple replace literal '\U2026' on a quotation, then encode it with NSUTF8StringEncoding encodind to NSData

Denis
  • 11
  • Nope. Not the way to do it, not extendable by any means. If I was to come along and add `\u2010`, for example, your code would break yet again, and you would need to re-compile with a new rule for that specific instance. – Richard J. Ross III Jun 13 '12 at 21:31