iOS - XML to NSString conversion

Question

I'm using NSXMLParser for parsing XML to my app and having a problem with the encoding type. For example, here is one of the feeds coming in. It looks similar to this"

\U2026Some random text from the xml feed\U2026

I am currently using the encoding type:

NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];

Which encoding type am I suppose to use for converting \U2026 into a ellipse (...) ??

I thought \U2026 was "..." in unicode. Also known as an Ellipsis. — Justin Paulson, Jun 13 '12 at 21:37
My bad, you are correct that one is ellipse. However I do have some quotations also. — Romes, Jun 14 '12 at 13:48

score 1 · Answer 1 · edited May 23 '17 at 11:55

The answer here is you're screwed. They are using a non-standard encoding for XML, but what if they really want the literal \U2026? Let's say you add a decoder to handle all \UXXXX and \uXXXX encodings. What happens when another feed want the data to be the literal \U2026?

You're first choice and best bet is to get this feed fixed. If they need to encode data, they need to use proper HTML entities or numeric references.

As a fallback, I would isolate the decoder away from the XML parser. Don't create a non-conforming XML parser just because your getting non-conforming data. Have a post processor that would only be run on the offending feed.

If you must have a decoder, then there is more bad news. There is no built in decoder, you will need to find a category online or write one up yourself.

After some poking around, I think Using Objective C/Cocoa to unescape unicode characters, ie \u1234 may work for you.

score 1 · Answer 2 · answered Jun 13 '12 at 21:51

Alright, heres a snippet of code that should work for any unicode code-point:

NSString *stringByUnescapingUnicodeSymbols(NSString *input)
{
    NSMutableString *output = [NSMutableString stringWithCapacity:[input length]];

    // get the UTF8 string for this string...
    const char *UTF8Str = [input UTF8String];

    while (*UTF8Str) {
        if (*UTF8Str == '\\' && tolower(*(UTF8Str + 1)) == 'u')
        {
            // skip the next 2 chars '\' and 'u'
            UTF8Str += 2;

            // make sure we only read 4 chars
            char tmp[5] = { UTF8Str[0], UTF8Str[1], UTF8Str[2], UTF8Str[3], 0 };
            long unicode = strtol(tmp, NULL, 16); // remember that Unicode is base 16

            [output appendFormat:@"%C", unicode];

            // move on with the string (making sure we dont miss the end of the string
            for (int i = 0; i < 4; i++) {
                if (*UTF8Str == 0)
                    break;
                UTF8Str++;
            }
        }
        else 
        {
            if (*UTF8Str == 0)
                break;

            [output appendFormat:@"%c", *UTF8Str];
        }


        UTF8Str++;
    }

    return output;
}

score 0 · Answer 3 · answered Jun 13 '12 at 21:28

0

You should simple replace literal '\U2026' on a quotation, then encode it with NSUTF8StringEncoding encodind to NSData

answered Jun 13 '12 at 21:28

Denis

11

Nope. Not the way to do it, not extendable by any means. If I was to come along and add `\u2010`, for example, your code would break yet again, and you would need to re-compile with a new rule for that specific instance. – Richard J. Ross III Jun 13 '12 at 21:31

iOS - XML to NSString conversion

3 Answers3