76

Wondering if there is an easy way to do a simple HTML escape/unescape in Objective C. What I want is something like this psuedo code:

NSString *string = @"<span>Foo</span>";
[string stringByUnescapingHTML];

Which returns

<span>Foo</span>

Hopefully unescaping all other HTML entities as well and even ASCII codes like Ӓ and the like.

Is there any methods in Cocoa Touch/UIKit to do this?

Peter Hosey
  • 95,783
  • 15
  • 211
  • 370
Alex Wayne
  • 178,991
  • 47
  • 309
  • 337
  • Probably the simplest way now with iOS7 is to use NSAttributedString's ability to decode HTML and then convert the NSAttributedString to an NSString - see my answer below. – orj Feb 20 '14 at 07:37

14 Answers14

91

Check out my NSString category for XMLEntities. There's methods to decode XML entities (including all HTML character references), encode XML entities, stripping tags and removing newlines and whitespace from a string:

- (NSString *)stringByStrippingTags;
- (NSString *)stringByDecodingXMLEntities; // Including all HTML character references
- (NSString *)stringByEncodingXMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
Hemang
  • 26,840
  • 19
  • 119
  • 186
Michael Waterfall
  • 20,497
  • 27
  • 111
  • 168
  • 2
    Seems it doesn't support Cyrillic. Have you seen one that supports? – slatvick Nov 10 '10 at 16:54
  • Thanks, I was already using your parses by the way. Great work! – Abramodj Jan 25 '12 at 00:53
  • Works like charme. Thanks for this great category! – DevZarak Mar 06 '12 at 22:22
  • Thanks Michael. Do you have a version for Arc? I know I can -fno-objc-arc in the build phase - but prefer not. – Dejell Jan 01 '13 at 13:01
  • 9
    What is up with the funky license? Cannot be used for diaries and journals? – alltom Mar 14 '13 at 01:56
  • @alltom "including, but not limited to" ... it's more restrictive than just diaries and journals, but ambiguously so. – Desty Aug 23 '13 at 17:40
  • 1
    This category is using the Google Toolbox category under the hood. It's better to just install the Google Toolbox helper directly via Cocoapods: `pod "GTMNSStringHTMLAdditions"`. See Travis's answer from September 2015. – skensell Jun 16 '16 at 09:44
37

Another HTML NSString category from Google Toolbox for Mac
Despite the name, this works on iOS too.

http://google-toolbox-for-mac.googlecode.com/svn/trunk/Foundation/GTMNSString+HTML.h

/// Get a string where internal characters that are escaped for HTML are unescaped 
//
///  For example, '&amp;' becomes '&'
///  Handles &#32; and &#x32; cases as well
///
//  Returns:
//    Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;

And I had to include only three files in the project: header, implementation and GTMDefines.h.

Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
  • 2
    Worth noting that if you're looking for the opposite of this, that is, `'&'` becomes `'&'`, that's also covered in `- (NSString *)gtm_stringByEscapingForHTML;`, defined later in the file. – Kristian Nov 09 '11 at 17:39
  • Please, can u provide a link for `GTMDefines.h` – Almas Adilbek Jan 29 '13 at 12:08
  • Worth noting that this category isn't compatible with ARC, as it uses Objective-C objects in a struct, which isn't supported. Even setting the `-fno-objc-arc` compiler flag doesn't stop the struct being flagged as an error in Xcode. – robotpukeko Jul 04 '13 at 03:51
  • @robotpukeko That's strange because I was able to compile ARC project with this category just setting flag to .m file. – Timur Kuchkarov Jul 08 '13 at 11:51
  • just add -fno-objc-arc to the compile sources. and it works fine. – yong ho Aug 20 '13 at 11:03
  • I definitely got it to build by using `-fno-objc-arc` for `GTMNSString+HTML.m`. Add this as compiler flag in the target settings under "Build Phases". For the record you need `GTMNSString+HTML.m`, `GTMNSString+HTML.h`, and `GTMDefines.h`. – David Gish Jan 08 '14 at 17:34
  • You don't have to worry about compiler flags if you install this Category through Cocoapods. See Travis's answer, it's simply: `pod "GTMNSStringHTMLAdditions"` – skensell Jun 16 '16 at 09:42
31

This link contains the solution below. Cocoa CF has the CFXMLCreateStringByUnescapingEntities function but that's not available on the iPhone.

@interface MREntitiesConverter : NSObject <NSXMLParserDelegate>{
    NSMutableString* resultString;
}

@property (nonatomic, retain) NSMutableString* resultString;

- (NSString*)convertEntitiesInString:(NSString*)s;

@end


@implementation MREntitiesConverter

@synthesize resultString;

- (id)init
{
    if([super init]) {
        resultString = [[NSMutableString alloc] init];
    }
    return self;
}

- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)s {
        [self.resultString appendString:s];
}

- (NSString*)convertEntitiesInString:(NSString*)s {
    if (!s) {
        NSLog(@"ERROR : Parameter string is nil");
    }
    NSString* xmlStr = [NSString stringWithFormat:@"<d>%@</d>", s];
    NSData *data = [xmlStr dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES];
    NSXMLParser* xmlParse = [[[NSXMLParser alloc] initWithData:data] autorelease];
    [xmlParse setDelegate:self];
    [xmlParse parse];
    return [NSString stringWithFormat:@"%@",resultString];
}

- (void)dealloc {
    [resultString release];
    [super dealloc];
}

@end
Thomas
  • 373
  • 2
  • 9
Andrew Grant
  • 58,260
  • 22
  • 130
  • 143
  • 1
    Wouldn't it be easier to implement this as an NSString category rather than an entirely separate object? Also, the return string is not autoreleased but the caller shouldn't own it because it was not explicitly allocated by the caller. – dreamlax Mar 18 '09 at 23:42
  • 6
    xmlParse also leaks btw, just add an autorelease to it and returnStr – Jarin Udom Apr 10 '09 at 21:53
  • 1
    If you make it an NSString category, you still need a delegate for the parser. So you will need a separate object anyway. – William Jockusch May 13 '10 at 15:30
  • 4
    Even though `CFXMLCreateStringByUnescapingEntities` is not available on iOS, you can copy its definition from CFXMLParser.c (from the Core Foundation source code) and use it in your project. I've tested it and it works. – Chaitanya Gupta Feb 03 '12 at 19:36
  • 2
    I found that this code removes all html tags (for example it left just "Facebook" from "Facebook") and sometimes just return nothing when complex html passed in. So, unfortunately it doesn't work for my goals. – Mike Keskinov Apr 15 '14 at 16:17
29

This is an incredibly hacked together solution I did, but if you want to simply escape a string without worrying about parsing, do this:

-(NSString *)htmlEntityDecode:(NSString *)string
    {
        string = [string stringByReplacingOccurrencesOfString:@"&quot;" withString:@"\""];
        string = [string stringByReplacingOccurrencesOfString:@"&apos;" withString:@"'"];
        string = [string stringByReplacingOccurrencesOfString:@"&lt;" withString:@"<"];
        string = [string stringByReplacingOccurrencesOfString:@"&gt;" withString:@">"];
        string = [string stringByReplacingOccurrencesOfString:@"&amp;" withString:@"&"]; // Do this last so that, e.g. @"&amp;lt;" goes to @"&lt;" not @"<"

        return string;
    }

I know it's by no means elegant, but it gets the job done. You can then decode an element by calling:

string = [self htmlEntityDecode:string];

Like I said, it's hacky but it works. IF you want to encode a string, just reverse the stringByReplacingOccurencesOfString parameters.

Rik Renich
  • 774
  • 6
  • 12
Andrew Kozlik
  • 1,189
  • 1
  • 11
  • 15
  • 5
    And how about perfomance?? You are going through the string 5 times. It doesn't seem very efficient ;) – HyLian Sep 17 '10 at 22:01
  • It's definitely not the most efficient solution, but it works. What would be a more efficient way to do this? – Andrew Kozlik Sep 28 '10 at 13:46
  • 6
    Depending on how often this is used and how much time you can actually save by making this more efficient, it may not make sense to micro-optimize here. Since we're dealing with HTML here, it's likely that there's a network request somewhere, and it's going to take thousands of times longer to return than for the code shown above to execute. I'd probably lean towards not optimizing this code. – Josh Brown Jan 27 '11 at 04:26
  • The proposed method has bad performance but works ok if you need rarely process short strings. Thanks for saving time for implementing these 10 lines on my own ;) – Kostiantyn Sokolinskyi Apr 11 '11 at 08:41
  • @Andrew the more efficient way would be implementing you own string scanner which will convert all these XML character entity references into corresponding characters in one string scan. The time complexity will drop in 5 times. Or you can employ a library like the one proposed below by Nikita - http://stackoverflow.com/questions/659602/objective-c-html-escape-unescape/5163893#5163893 – Kostiantyn Sokolinskyi Apr 11 '11 at 08:48
  • @Kostiantyn thanks for that. I'll have to take a look at implementing that sometime in the future. I know this is hacky, but it got the job done quick. – Andrew Kozlik Apr 12 '11 at 15:02
  • I think this is great! Easy, simple and effective for what I am doing. Thank you. – johnnelm9r Sep 02 '13 at 06:26
  • This solution is really poor. There's more than 5 html entities so unless you want extend this as you experience seeing new entities on a case by case basis (which is terrible practice in itself), find another way. – bitwit Mar 12 '14 at 20:22
  • Above method does the job partially. After calling above method on our HTML Encoded String, it gives back a string with HTML Elements. So what's the way to strip out those HTML Elements? :-( – Randika Vishman Jun 16 '14 at 20:51
  • I seriously think this is the **best solution out there**, @Andrew. It's totally unbelievable that this is not built in to iOS in this day and age. Thanks again! – Fattie Sep 01 '14 at 14:12
  • Apart from being incomplete and expensive this solution is just plain wrong for some cases: `&lt;` should be unmasked to `<` (not `<`). – Nikolai Ruhe Oct 14 '14 at 09:43
11

In iOS 7 you can use NSAttributedString's ability to import HTML to convert HTML entities to an NSString.

Eg:

@interface NSAttributedString (HTML)
+ (instancetype)attributedStringWithHTMLString:(NSString *)htmlString;
@end

@implementation NSAttributedString (HTML)
+ (instancetype)attributedStringWithHTMLString:(NSString *)htmlString
{
    NSDictionary *options = @{ NSDocumentTypeDocumentAttribute : NSHTMLTextDocumentType,
                               NSCharacterEncodingDocumentAttribute :@(NSUTF8StringEncoding) };

    NSData *data = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

    return [[NSAttributedString alloc] initWithData:data options:options documentAttributes:nil error:nil];
}

@end

Then in your code when you want to clean up the entities:

NSString *cleanString = [[NSAttributedString attributedStringWithHTMLString:question.title] string];

This is probably the simplest way, but I don't know how performant it is. You should probably be pretty damn sure the content your "cleaning" doesn't contain any <img> tags or stuff like that because this method will download those images during the HTML to NSAttributedString conversion. :)

orj
  • 13,234
  • 14
  • 63
  • 73
  • I did this by writing a method that takes the string, cleans it, and returns the cleaned string back. See it [here](https://gist.github.com/asimpson/4de93d0f64bd8953b506). – Adam Simpson Mar 19 '14 at 17:19
  • This solution also removes all existing HTML tags, for example it left `this is test` from `this is test`. – Mike Keskinov Apr 15 '14 at 16:28
  • 2
    Just a heads up, the NSAttributedString does terrible things in the constructor, like spinning the runloop. I was un-able to use this on the main thread without making UIKit very un-happy. – Brian King Dec 10 '14 at 20:43
  • This is rad. Thank you so much, worked like a charm for me. – Tim Johnsen Jul 11 '17 at 16:29
5

Here's a solution that neutralizes all characters (by making them all HTML encoded entities for their unicode value)... Used this for my need (making sure a string that came from the user but was placed inside of a webview couldn't have any XSS attacks):

Interface:

@interface NSString (escape)
- (NSString*)stringByEncodingHTMLEntities;
@end

Implementation:

@implementation NSString (escape)

- (NSString*)stringByEncodingHTMLEntities {
    // Rather then mapping each individual entity and checking if it needs to be replaced, we simply replace every character with the hex entity

    NSMutableString *resultString = [NSMutableString string];
    for(int pos = 0; pos<[self length]; pos++)
        [resultString appendFormat:@"&#x%x;",[self characterAtIndex:pos]];
    return [NSString stringWithString:resultString];
}

@end

Usage Example:

UIWebView *webView = [[UIWebView alloc] init];
NSString *userInput = @"<script>alert('This is an XSS ATTACK!');</script>";
NSString *safeInput = [userInput stringByEncodingHTMLEntities];
[webView loadHTMLString:safeInput baseURL:nil];

Your mileage will vary.

BadPirate
  • 25,802
  • 10
  • 92
  • 123
  • You're missing a ';' at the end of the escape sequence, also, in all the docs I found the length of a unicode number is 4 with leading zeros, so your format should be `@"%04x;"`, other than that, I'd add a simple alpha numeric detector and just copy such characters without escaping. – Moshe Gottlieb Feb 03 '13 at 11:42
  • Interestingly enough, this code is working fine for me without the semi-colon. Probably just webkit being robust. I added that. However don't do the %04x as suggested, or you could have trouble with single-byte multi-byte unicode characters. Using %x prints the correct number for both single and multi-byte (like japanese). – BadPirate Feb 04 '13 at 18:22
4

The least invasive and most lightweight way to encode and decode HTML or XML strings is to use the GTMNSStringHTMLAdditions CocoaPod.

It is simply the Google Toolbox for Mac NSString category GTMNSString+HTML, stripped of the dependency on GTMDefines.h. So all you need to add is one .h and one .m, and you're good to go.

Example:

#import "GTMNSString+HTML.h"

// Encoding a string with XML / HTML elements
NSString *stringToEncode = @"<TheBeat>Goes On</TheBeat>";
NSString *encodedString = [stringToEncode gtm_stringByEscapingForHTML];

// encodedString looks like this now:
// &lt;TheBeat&gt;Goes On&lt;/TheBeat&gt;

// Decoding a string with XML / HTML encoded elements
NSString *stringToDecode = @"&lt;TheBeat&gt;Goes On&lt;/TheBeat&gt;";
NSString *decodedString = [stringToDecode gtm_stringByUnescapingFromHTML];

// decodedString looks like this now:
// <TheBeat>Goes On</TheBeat>
T Blank
  • 1,408
  • 1
  • 16
  • 21
2

This is an easy to use NSString category implementation:

It is far from complete but you can add some missing entities from here: http://code.google.com/p/statz/source/browse/trunk/NSString%2BHTML.m

Usage:

#import "NSString+HTML.h"

NSString *raw = [NSString stringWithFormat:@"<div></div>"];
NSString *escaped = [raw htmlEscapedString];
Blago
  • 4,697
  • 2
  • 34
  • 29
  • I can confirm that this category works perfectly. It is perfectly written. I urge everyone to use it - I doubt there's a better solution out there! Again it's totally amazing this is not yet built in to iOS .. bizarro. Thanks @blago – Fattie Sep 01 '14 at 14:19
1

The MREntitiesConverter above is an HTML stripper, not encoder.

If you need an encoder, go here: Encode NSString for XML/HTML

Community
  • 1
  • 1
Brain2000
  • 4,655
  • 2
  • 27
  • 35
0

MREntitiesConverter doesn't work for escaping malformed xml. It will fail on a simple URL:

http://www.google.com/search?client=safari&rls=en&q=fail&ie=UTF-8&oe=UTF-8

richcollins
  • 1,504
  • 4
  • 18
  • 28
0

If you need to generate a literal you might consider using a tool like this:

http://www.freeformatter.com/java-dotnet-escape.html#ad-output

to accomplish the work for you.

See also this answer.

Community
  • 1
  • 1
diadyne
  • 4,038
  • 36
  • 28
0

This easiest solution is to create a category as below:

Here’s the category’s header file:

#import <Foundation/Foundation.h>
@interface NSString (URLEncoding)
-(NSString *)urlEncodeUsingEncoding:(NSStringEncoding)encoding;
@end

And here’s the implementation:

#import "NSString+URLEncoding.h"
@implementation NSString (URLEncoding)
-(NSString *)urlEncodeUsingEncoding:(NSStringEncoding)encoding {
    return (NSString *)CFURLCreateStringByAddingPercentEscapes(NULL,
               (CFStringRef)self,
               NULL,
               (CFStringRef)@"!*'\"();:@&=+$,/?%#[]% ",
               CFStringConvertNSStringEncodingToEncoding(encoding));
}
@end

And now we can simply do this:

NSString *raw = @"hell & brimstone + earthly/delight";
NSString *url = [NSString stringWithFormat:@"http://example.com/example?param=%@",
            [raw urlEncodeUsingEncoding:NSUTF8Encoding]];
NSLog(url);

The credits for this answer goes to the website below:-

http://madebymany.com/blog/url-encoding-an-nsstring-on-ios
Hashim Akhtar
  • 813
  • 2
  • 11
  • 16
-4

Why not just using ?

NSData *data = [s dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES];
NSString *result = [[[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding] autorelease];
return result;

Noob question but in my case it works...

kheraud
  • 5,048
  • 7
  • 46
  • 75
  • 1
    Why would this work? Far as I can tell it simply converts to binary data and then back to a string. I don't understand what here would turn ">" into ">" and vice versa. – Alex Wayne Feb 18 '11 at 17:43
-5

This is an old answer that I posted some years ago. My intention was not to provide a "good" and "respectable" solution, but a "hacky" one that might be useful under some circunstances. Please, don't use this solution unless nothing else works.

Actually, it works perfectly fine in many situations that other answers don't because the UIWebView is doing all the work. And you can even inject some javascript (which can be dangerous and/or useful). The performance should be horrible, but actually is not that bad.

There is another solution that has to be mentioned. Just create a UIWebView, load the encoded string and get the text back. It escapes tags "<>", and also decodes all html entities (e.g. "&gt;") and it might work where other's don't (e.g. using cyrillics). I don't think it's the best solution, but it can be useful if the above solutions doesn't work.

Here is a small example using ARC:

@interface YourClass() <UIWebViewDelegate>

    @property UIWebView *webView;

@end

@implementation YourClass 

- (void)someMethodWhereYouGetTheHtmlString:(NSString *)htmlString {
    self.webView = [[UIWebView alloc] init];
    NSString *htmlString = [NSString stringWithFormat:@"<html><body>%@</body></html>", self.description];
    [self.webView loadHTMLString:htmlString baseURL:nil];
    self.webView.delegate = self;
}

- (void)webView:(UIWebView *)webView didFailLoadWithError:(NSError *)error {
    self.webView = nil;
}

- (void)webViewDidFinishLoad:(UIWebView *)webView {
    self.webView = nil;
    NSString *escapedString = [self.webView stringByEvaluatingJavaScriptFromString:@"document.body.textContent;"];
}

- (void)webViewDidStartLoad:(UIWebView *)webView {
    // Do Nothing
}

@end
FranMowinckel
  • 4,233
  • 1
  • 30
  • 26