103

First of all, I found this: Objective C HTML escape/unescape, but it doesn't work for me.

My encoded characters (come from a RSS feed, btw) look like this: &

I searched all over the net and found related discussions, but no fix for my particular encoding, I think they are called hexadecimal characters.

Community
  • 1
  • 1
treznik
  • 7,955
  • 13
  • 47
  • 59
  • 3
    This comment is six months after the original question, so it's more for those that stumble across this question looking for an answer and a solution. A very similar question came up just recently that I answered http://stackoverflow.com/questions/2254862/special-characters-in-nsstring-from-html/2260140#2260140 It uses RegexKitLite and Blocks to do a search and replace of the `...;` in a string with its equivalent character. – johne Feb 16 '10 at 05:48
  • What specifically “doesn't work”? I don't see anything in this question that isn't a duplicate of that earlier question. – Peter Hosey Mar 03 '10 at 14:45
  • It's decimal. Hexadecimal is `8`. – kennytm Mar 03 '10 at 14:46
  • The difference between decimal and hexadecimal being that decimal is base-10, whereas hexadecimal is base-16. “38” is a different number in each base; in base 10, it's 3×10 + 8×1 = thirty-eight, whereas in base-16, it's 3×16 + 8×1 = fifty-six. Higher digits are (multiples of) higher powers of the base; the lowest whole digit is base**0 (= 1), the next higher digit is base**1 (= base), the next one is base**2 (= base * base), etc. This is exponentation at work. – Peter Hosey Mar 03 '10 at 17:34
  • http://stackoverflow.com/questions/25607247/how-do-i-decode-html-entities-in-swift – Sanju Mar 23 '17 at 06:15

13 Answers13

164

Check out my NSString category for HTML. Here are the methods available:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
Michael Waterfall
  • 20,497
  • 27
  • 111
  • 168
53

The one by Daniel is basically very nice, and I fixed a few issues there:

  1. removed the skipping character for NSSCanner (otherwise spaces between two continuous entities would be ignored

    [scanner setCharactersToBeSkipped:nil];

  2. fixed the parsing when there are isolated '&' symbols (I am not sure what is the 'correct' output for this, I just compared it against firefox):

e.g.

    &#ABC DF & B'  & C' Items (288)

here is the modified code:

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"'" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@""" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"<" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", (unichar)charCode];

                [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

                [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


                [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
            NSString *amp;

            [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol
            [result appendString:amp];

            /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
             */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Ant
  • 4,890
  • 1
  • 31
  • 42
Walty Yeung
  • 3,396
  • 32
  • 33
49

As of iOS 7, you can decode HTML characters natively by using an NSAttributedString with the NSHTMLTextDocumentType attribute:

NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
                                                 options:options
                                      documentAttributes:NULL
                                                   error:NULL];

The decoded attributed string will now be displayed as:  & & < > ™ © ♥ ♣ ♠ ♦.

Note: This will only work if called on the main thread.

Bryan Luby
  • 2,527
  • 22
  • 31
  • 7
    best answer if you don't need to support iOS 6 and older – jcesarmobile May 22 '14 at 10:59
  • 1
    no, not the best if someone wants to encode it on bg thread ;O – badeleux Oct 20 '14 at 08:11
  • 4
    This worked for decoding an entity, but it also messed up a non-encoded dash. – Andrew Dec 09 '14 at 21:31
  • This is forced to happen on the main thread. So you probably don't want to do this if you don't have to. – Keith Smiley Jan 07 '15 at 04:02
  • It just hangs the GUI when it's matter of UITableView. Hence, not working correctly. – Asif Bilal Apr 01 '15 at 16:05
  • Great solution. Unfortunately, non-encoded dashes causes problems. This is what I did prior the decoding code: `string = [string stringByReplacingOccurrencesOfString:@"–" withString:@"–"]; string = [string stringByReplacingOccurrencesOfString:@"—" withString:@"—"];` Please let me know if any other characters causes problems as well??? – Mike Keskinov Oct 31 '16 at 15:14
  • nice concise solution. when i return decodedString i return the [decodedString string] instead to return NSString instead of NSAttributedString – debonaire Jan 14 '20 at 09:25
  • Nice solution, note though that it is pretty _invasive_, i.e. it reduces multiple consecutive spaces into one. That might not be intended if all you want to do is decoding HTML entities. – DrMickeyLauer Apr 21 '22 at 12:32
46

Those are called Character Entity References. When they take the form of &#<number>; they are called numeric entity references. Basically, it's a string representation of the byte that should be substituted. In the case of &#038;, it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &.

The reason the ampersand has to be encoded in RSS is it's a reserved special character.

What you need to do is parse the string and replace the entities with a byte matching the value between &# and ;. I don't know of any great ways to do this in objective C, but this stack overflow question might be of some help.

Edit: Since answering this some two years ago there are some great solutions; see @Michael Waterfall's answer below.

Community
  • 1
  • 1
Matt Bridges
  • 48,277
  • 7
  • 47
  • 61
  • 2
    +1 I was just about to submit the exact same answer (including the same links, no less!) – e.James Jul 09 '09 at 17:17
  • “Basically, it's a string representation of the byte that should be substituted.” More like character. This is text, not data; upon converting the text to data, the character may occupy multiple bytes, depending on the character and the encoding. – Peter Hosey Jul 09 '09 at 18:17
  • Thanks for the reply. You said "it represents the character with the value of 38 in the ISO-8859-1 character encoding scheme, which is &". Are you sure about that? Do you have a link to a character table of this type? Because from what I recall that was a single quote. – treznik Jul 11 '09 at 19:59
  • http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1 or just type & into google. – Matt Bridges Jul 12 '09 at 11:39
  • and what about & or © symbols? – vokilam Apr 23 '13 at 07:57
35

Nobody seems to mention one of the simplest options: Google Toolbox for Mac
(Despite the name, this works on iOS too.)

https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h

/// Get a string where internal characters that are escaped for HTML are unescaped 
//
///  For example, '&amp;' becomes '&'
///  Handles &#32; and &#x32; cases as well
///
//  Returns:
//    Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;

And I had to include only three files in the project: header, implementation and GTMDefines.h.

Raptor
  • 53,206
  • 45
  • 230
  • 366
Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
18

I ought to post this on GitHub or something. This goes in a category of NSString, uses NSScanner for the implementation, and handles both hex and decimal numeric character entities as well as the usual symbolic ones.

Also, it handles malformed strings (when you have an & followed by an invalid sequence of characters) relatively gracefully, which turned out to be crucial in my released app that uses this code.

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];
    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }
            if (gotNumber) {
                [result appendFormat:@"%C", charCode];
            }
            else {
                NSString *unknownEntity = @"";
                [scanner scanUpToString:@";" intoString:&unknownEntity];
                [result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
            }
            [scanner scanString:@";" intoString:NULL];
        }
        else {
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
        }
    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Quinn Taylor
  • 44,553
  • 16
  • 113
  • 131
Daniel Dickison
  • 21,832
  • 13
  • 69
  • 89
4

This is the way I do it using RegexKitLite framework:

-(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html {
NSString* result = [html copy];
NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"];

if (![matches count]) 
    return result;

for (int i=0; i<[matches count]; i++) {
    NSArray* array = [matches objectAtIndex: i];
    NSString* charCode = [array objectAtIndex: 1];
    int code = [charCode intValue];
    NSString* character = [NSString stringWithFormat:@"%C", code];
    result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0]
                                               withString: character];      
}   
return result;  

}

Hope this will help someone.

realsugar
  • 422
  • 2
  • 8
4

you can use just this function to solve this problem.

+ (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str
{
    NSMutableString* string = [[NSMutableString alloc] initWithString:str];  // #&39; replace with '
    NSString* unicodeStr = nil;
    NSString* replaceStr = nil;
    int counter = -1;

    for(int i = 0; i < [string length]; ++i)
    {
        unichar char1 = [string characterAtIndex:i];    
        for (int k = i + 1; k < [string length] - 1; ++k)
        {
            unichar char2 = [string characterAtIndex:k];    

            if (char1 == '&'  && char2 == '#' ) 
            {   
                ++counter;
                unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)];    
                // read integer value i.e, 39
                replaceStr = [string substringWithRange:NSMakeRange (i, 5)];     //     #&39;
                [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]];
                break;
            }
        }
    }
    [string autorelease];

    if (counter > 1)
        return  [self decodeHtmlUnicodeCharactersToString:string]; 
    else
        return string;
}
Scott Evernden
  • 39,136
  • 15
  • 78
  • 84
3

Here's a Swift version of Walty Yeung's answer:

extension String {
    static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",]

    func stringByDecodingXMLEntities() -> String {

        guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else {
            return self
        }

        var result = ""

        let scanner = NSScanner(string: self)
        scanner.charactersToBeSkipped = nil

        let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;")

        repeat {
            var nonEntityString: NSString? = nil

            if scanner.scanUpToString("&", intoString: &nonEntityString) {
                if let s = nonEntityString as? String {
                    result.appendContentsOf(s)
                }
            }

            if scanner.atEnd {
                break
            }

            var didBreak = false
            for (k,v) in String.mappings {
                if scanner.scanString(k, intoString: nil) {
                    result.appendContentsOf(v)
                    didBreak = true
                    break
                }
            }

            if !didBreak {

                if scanner.scanString("&#", intoString: nil) {

                    var gotNumber = false
                    var charCodeUInt: UInt32 = 0
                    var charCodeInt: Int32 = -1
                    var xForHex: NSString? = nil

                    if scanner.scanString("x", intoString: &xForHex) {
                        gotNumber = scanner.scanHexInt(&charCodeUInt)
                    }
                    else {
                        gotNumber = scanner.scanInt(&charCodeInt)
                    }

                    if gotNumber {
                        let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
                        result.appendContentsOf(newChar)
                        scanner.scanString(";", intoString: nil)
                    }
                    else {
                        var unknownEntity: NSString? = nil
                        scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity)
                        let h = xForHex ?? ""
                        let u = unknownEntity ?? ""
                        result.appendContentsOf("&#\(h)\(u)")
                    }
                }
                else {
                    scanner.scanString("&", intoString: nil)
                    result.appendContentsOf("&")
                }
            }

        } while (!scanner.atEnd)

        return result
    }
}
Community
  • 1
  • 1
Max Chuquimia
  • 7,494
  • 2
  • 40
  • 59
1

Actually the great MWFeedParser framework of Michael Waterfall (referred to his answer) has been forked by rmchaara who has update it with ARC support!

You can find it in Github here

It really works great, I used stringByDecodingHTMLEntities method and works flawlessly.

angelos.p
  • 500
  • 1
  • 5
  • 12
0

As if you need another solution! This one is pretty simple and quite effective:

@interface NSString (NSStringCategory)
- (NSString *) stringByReplacingISO8859Codes;
@end


@implementation NSString (NSStringCategory)
- (NSString *) stringByReplacingISO8859Codes
{
    NSString *dataString = self;
    do {
        //*** See if string contains &# prefix
        NSRange range = [dataString rangeOfString: @"&#" options: NSRegularExpressionSearch];
        if (range.location == NSNotFound) {
            break;
        }
        //*** Get the next three charaters after the prefix
        NSString *isoHex = [dataString substringWithRange: NSMakeRange(range.location + 2, 3)];
        //*** Create the full code for replacement
        NSString *isoString = [NSString stringWithFormat: @"&#%@;", isoHex];
        //*** Convert to decimal integer
        unsigned decimal = 0;
        NSScanner *scanner = [NSScanner scannerWithString: [NSString stringWithFormat: @"0%@", isoHex]];
        [scanner scanHexInt: &decimal];
        //*** Use decimal code to get unicode character
        NSString *unicode = [NSString stringWithFormat:@"%C", decimal];
        //*** Replace all occurences of this code in the string
        dataString = [dataString stringByReplacingOccurrencesOfString: isoString withString: unicode];
    } while (TRUE); //*** Loop until we hit the NSNotFound

    return dataString;
}
@end
mpemburn
  • 2,776
  • 1
  • 35
  • 41
0

If you have the Character Entity Reference as a string, e.g. @"2318", you can extract a recoded NSString with the correct unicode character using strtoul;

NSString *unicodePoint = @"2318"
unichar iconChar = (unichar) strtoul(unicodePoint.UTF8String, NULL, 16);
NSString *recoded = [NSString stringWithFormat:@"%C", iconChar];
NSLog(@"recoded: %@", recoded");
// prints out "recoded: ⌘"
Henrik Hartz
  • 3,677
  • 1
  • 27
  • 28
0

Swift 3 version of Jugale's answer

extension String {
    static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",]

    func stringByDecodingXMLEntities() -> String {

        guard let _ = self.range(of: "&", options: [.literal]) else {
            return self
        }

        var result = ""

        let scanner = Scanner(string: self)
        scanner.charactersToBeSkipped = nil

        let boundaryCharacterSet = CharacterSet(charactersIn: " \t\n\r;")

        repeat {
            var nonEntityString: NSString? = nil

            if scanner.scanUpTo("&", into: &nonEntityString) {
                if let s = nonEntityString as? String {
                    result.append(s)
                }
            }

            if scanner.isAtEnd {
                break
            }

            var didBreak = false
            for (k,v) in String.mappings {
                if scanner.scanString(k, into: nil) {
                    result.append(v)
                    didBreak = true
                    break
                }
            }

            if !didBreak {

                if scanner.scanString("&#", into: nil) {

                    var gotNumber = false
                    var charCodeUInt: UInt32 = 0
                    var charCodeInt: Int32 = -1
                    var xForHex: NSString? = nil

                    if scanner.scanString("x", into: &xForHex) {
                        gotNumber = scanner.scanHexInt32(&charCodeUInt)
                    }
                    else {
                        gotNumber = scanner.scanInt32(&charCodeInt)
                    }

                    if gotNumber {
                        let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt)
                        result.append(newChar)
                        scanner.scanString(";", into: nil)
                    }
                    else {
                        var unknownEntity: NSString? = nil
                        scanner.scanUpToCharacters(from: boundaryCharacterSet, into: &unknownEntity)
                        let h = xForHex ?? ""
                        let u = unknownEntity ?? ""
                        result.append("&#\(h)\(u)")
                    }
                }
                else {
                    scanner.scanString("&", into: nil)
                    result.append("&")
                }
            }

        } while (!scanner.isAtEnd)

        return result
    }
}
Xzya
  • 337
  • 7
  • 16