2

I'm in the process of porting an Android app to iOS and I've hit a small roadblock. I'm pulling HTML encoded data from a webpage but some of the data is presented in Unicode to display foreign characters... so characters in Russian (Лети за мной) will be parsed out as, "Лет..."

In android I was able to get around this by calling HTML.fromHTML(). Is there anything similar in iOS?

bobince
  • 528,062
  • 107
  • 651
  • 834
Scott
  • 493
  • 1
  • 5
  • 17
  • What's the problem here? UTF-8 is extremely common these days. You didn't give any details on what you're using for HTML parsing, or really what your issue is. – Lily Ballard Sep 29 '11 at 19:58
  • Ah, you updated. I take it you mean the data is encoded with HTML entities, but does not, in fact, include HTML tags? – Lily Ballard Sep 29 '11 at 20:05

3 Answers3

6

It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an NSMutableString you can use [str appendFormat:@"%C", thechar]. NSScanner can make the string scanning pretty easy, especially since it already knows how to read hex numbers.

I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}
Lily Ballard
  • 182,031
  • 33
  • 381
  • 347
  • Your code failed and I didn't feel like trying to fix it :P Instead I looked into your method and apparently Google released the "Google Toolbox for Mac" (http://code.google.com/p/google-toolbox-for-mac/) which contains a bunch of useful functions that do what I'm trying to achieve. Thanks for pointing me in the right direction. – Scott Sep 29 '11 at 21:33
  • @Scott: Sorry about that. I just tried compiling the thing, and fixed the errors. The version I have posted now should work, though I haven't tested it exhaustively. – Lily Ballard Sep 29 '11 at 21:44
  • 1
    Care: there are a lot more HTML entities that might be used than just `&"<>`! – bobince Sep 29 '11 at 23:20
  • @bobince: Sure, and a "real" HTML parser would take care to deal with all of them. But most data that's entity-encoded uses numeric entities, the named ones tend to only show up in human-created HTML. – Lily Ballard Sep 29 '11 at 23:31
  • @Kevin: Unfortunately a lot of PHP authors seem to love using `htmlentities()` instead of the generally-more-appropriate `htmlspecialchars()`, resulting in loads of unnecessary entity references. :-( – bobince Sep 29 '11 at 23:38
  • @bobince: Here's a [complete list](http://www.w3.org/TR/html5/named-character-references.html) of all named entities in HTML5. It's a *huge* table. You're free to encode that as a table used by this function, but it's not something I'd care to do. – Lily Ballard Sep 29 '11 at 23:55
2

The &#(number); construct in HTML (and XML) is known as a character reference. It's not Unicode-specific, other than in that all characters in HTML are defined in terms of Unicode, whether included verbatim or encoded as a character or entity reference. (Entity references are the named ones that look like &eacute; or &amp; and if you are scraping an HTML page you will certainly have to deal with those as well.)

There isn't a function in the standard library for decoding character or entity references. See this question for approaches to decoding HTML text content. If you only have character references and the standard XML entities like &amp; you can get away with leveraging NSXMLParser to parse an <element>+yourstring+</element>, but this won't handle HTML-specific entities like &eacute;.

In general, screen-scraping is best done using a proper HTML parser, rather than string-hacking. This will convert all text content into text nodes, converting the character and entity references as it goes. However, again, there is no HTML parser available in the standard library. If the target page is well-formed standalone XHTML you can again use NSXMLParser. Otherwise you might like to try libxml2, which offers an HTML parser as well as XML. See this question for some background.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834
0

if you get data from a website you will have an NS(Mutable)Data-Object as your receiving-buffer. You just have to transform that NSData into an NSString via:
NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
if your server is sending in Unicode. If your server is sending utf-8 or other then you have to adjust the stringencoding in your receiving-code as well.

here a list of all supported string-encoding-types

edit: take a look at this so-thread.

Community
  • 1
  • 1
thomas
  • 5,637
  • 2
  • 24
  • 35