108

There are a couple of different ways to remove HTML tags from an NSString in Cocoa.

One way is to render the string into an NSAttributedString and then grab the rendered text.

Another way is to use NSXMLDocument's -objectByApplyingXSLTString method to apply an XSLT transform that does it.

Unfortunately, the iPhone doesn't support NSAttributedString or NSXMLDocument. There are too many edge cases and malformed HTML documents for me to feel comfortable using regex or NSScanner. Does anyone have a solution to this?

One suggestion has been to simply look for opening and closing tag characters, this method won't work except for very trivial cases.

For example these cases (from the Perl Cookbook chapter on the same subject) would break this method:

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
TheNeil
  • 3,321
  • 2
  • 27
  • 52
lfalin
  • 4,219
  • 5
  • 31
  • 57
  • You could add a bit of logic to take quotes and apostrophes into account... CDATA would take a bit more work, but the whole point of HTML is that unknown tags can be ignored by the parser; if you treat ALL tags as unknown, then you should just get raw text. – Ben Gottlieb Nov 10 '08 at 17:44
  • I'd like to comment that a good (but basic) regular expression will definitely not break at your examples. Certainly not if you can guarantee well formed XHTML. I know that you said you can't, but I wonder why ;-) – Jake Oct 09 '09 at 12:54
  • 1
    There is **Good answer** for this question. [Flatten HTML using Objective c](http://rudis.net/content/2009/01/21/flatten-html-content-ie-strip-tags-cocoaobjective-c) – vipintj Jul 09 '10 at 09:12
  • Unfortunately, using NSScanner is damn slow. – steipete Mar 27 '11 at 15:56
  • Even more unfortunately, the linked NSScanner example only works for trivial html. It fails for every test case I mentioned in my post. – lfalin Jan 02 '13 at 14:35
  • Exactly why doesn't iOS support NSAttributedString for you? https://developer.apple.com/library/ios/documentation/Cocoa/Reference/Foundation/Classes/NSAttributedString_Class/index.html – jasonjwwilliams Feb 13 '15 at 02:18
  • @jasonjwwilliams I wrote this question in 2008. Support for NSAttributedString wasn't added to iOS until 3.2 (aka, the iPad release), which came out in April 2010. – lfalin Feb 13 '15 at 16:59
  • @ifalin Apologies, I lost track of the 2008 date of the original post while reading. – jasonjwwilliams Feb 17 '15 at 19:11
  • @jasonjwwilliams No worries. This is a problem with SO. You have answers to questions which often only apply as a "best practice" within a certain timeframe or API version. – lfalin Feb 17 '15 at 19:18

22 Answers22

313

A quick and "dirty" (removes everything between < and >) solution, works with iOS >= 3.2:

-(NSString *) stringByStrippingHTML {
  NSRange r;
  NSString *s = [[self copy] autorelease];
  while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
    s = [s stringByReplacingCharactersInRange:r withString:@""];
  return s;
}

I have this declared as a category os NSString.

dlinsin
  • 19,249
  • 13
  • 42
  • 53
m.kocikowski
  • 5,422
  • 2
  • 23
  • 9
  • I'm a complete newb at iPhone development, but can I ask how you use this? – James Apr 26 '12 at 14:29
  • 4
    @James To use the method posted in the solution. You have to create a category for NSString. Look up "Objective-C Category" in Google. Then you add that method in the m file, and the prototype in the h file. When that is all set up, to use it all you have to do is have a string object (Example: NSString *myString = ...) and you call that method on your string object (NSString *strippedString = [myString stringByStrippingHTML];). – Roberto May 02 '12 at 17:40
  • 3
    +1 Great use for regular expressions, but does not cover lots of cases unfortunately. – matm Jun 18 '12 at 14:53
  • This code would break on Perl Cookbook examples 1, 3, and 4 in the question. – Aaron Brager Apr 18 '13 at 16:04
  • I am getting a NSString may not respond to stringByStrippingHTML with this. – Rick Jul 09 '13 at 19:01
  • 3
    Quick and dirty indeed.... This function causes a huge memory leak in my application... Well, in its defence, I am using large amounts of data.... – EZFrag Sep 09 '13 at 08:37
  • 5
    In my App this solution caused performance problems. I switched to a solution with NSScanner instead NSRegularExpressionSearch. Now the performance problems are gone – Carmen Sep 13 '13 at 13:13
  • 2
    It is very very very memory and time consuming. Only use this with small amounts of html! – ullstrm Apr 02 '14 at 13:11
  • Great idea, but insanely inefficient code. Instead use -[NSRegularExpression enumerateMatchesInString:] so that you only parse the regex once and don't rescan text you've already scanned. – Adlai Holler Jun 19 '15 at 16:37
  • When I try this I am getting this error : ` Terminating app due to uncaught exception 'NSRangeException', reason: '-[__NSCFString substringWithRange:]: Range {8587, 53} out of bounds; string length 8300'` – Rahul Jan 19 '16 at 13:34
30

This NSString category uses the NSXMLParser to accurately remove any HTML tags from an NSString. This is a single .m and .h file that can be included into your project easily.

https://gist.github.com/leighmcculloch/1202238

You then strip html by doing the following:

Import the header:

#import "NSString_stripHtml.h"

And then call stripHtml:

NSString* mystring = @"<b>Hello</b> World!!";
NSString* stripped = [mystring stripHtml];
// stripped will be = Hello World!!

This also works with malformed HTML that technically isn't XML.

Community
  • 1
  • 1
Leigh McCulloch
  • 1,886
  • 24
  • 23
  • 3
    Whilst the regular expression (as said by m.kocikowski) is quick and dirty, this is more robust. Example string: @"My test name\">html string". This answer returns: My test html string. Regular expression returns: My test name">html string. Whilst this isn't that common, it's just more robust. – DonnaLea Sep 08 '11 at 06:29
  • 1
    Except if you have a string like "S&P 500", it will strip everything after the ampersand and just return the string "S". – Joshua Gross Sep 08 '13 at 22:12
11
UITextView *textview= [[UITextView alloc]initWithFrame:CGRectMake(10, 130, 250, 170)];
NSString *str = @"This is <font color='red'>simple</font>";
[textview setValue:str forKey:@"contentToHTMLString"];
textview.textAlignment = NSTextAlignmentLeft;
textview.editable = NO;
textview.font = [UIFont fontWithName:@"vardana" size:20.0];
[UIView addSubview:textview];

work fine for me

9

You can use like below

-(void)myMethod
 {

 NSString* htmlStr = @"<some>html</string>";
 NSString* strWithoutFormatting = [self stringByStrippingHTML:htmlStr];

 }

 -(NSString *)stringByStrippingHTML:(NSString*)str
 {
   NSRange r;
   while ((r = [str rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location     != NSNotFound)
  {
     str = [str stringByReplacingCharactersInRange:r withString:@""];
 }
  return str;
 }
Kirtikumar A.
  • 4,140
  • 43
  • 43
8

use this

NSString *myregex = @"<[^>]*>"; //regex to remove any html tag

NSString *htmlString = @"<html>bla bla</html>";
NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];

don't forget to include this in your code : #import "RegexKitLite.h" here is the link to download this API : http://regexkit.sourceforge.net/#Downloads

Johnny
  • 1,824
  • 23
  • 16
Mohamed AHDIDOU
  • 109
  • 1
  • 8
7

Take a look at NSXMLParser. It's a SAX-style parser. You should be able to use it to detect tags or other unwanted elements in the XML document and ignore them, capturing only pure text.

Colin Barrett
  • 4,451
  • 1
  • 26
  • 33
6

Here's a more efficient solution than the accepted answer:

- (NSString*)hp_stringByRemovingTags
{
    static NSRegularExpression *regex = nil;
    static dispatch_once_t onceToken;
    dispatch_once(&onceToken, ^{
        regex = [NSRegularExpression regularExpressionWithPattern:@"<[^>]+>" options:kNilOptions error:nil];
    });

    // Use reverse enumerator to delete characters without affecting indexes
    NSArray *matches =[regex matchesInString:self options:kNilOptions range:NSMakeRange(0, self.length)];
    NSEnumerator *enumerator = matches.reverseObjectEnumerator;

    NSTextCheckingResult *match = nil;
    NSMutableString *modifiedString = self.mutableCopy;
    while ((match = [enumerator nextObject]))
    {
        [modifiedString deleteCharactersInRange:match.range];
    }
    return modifiedString;
}

The above NSString category uses a regular expression to find all the matching tags, makes a copy of the original string and finally removes all the tags in place by iterating over them in reverse order. It's more efficient because:

  • The regular expression is initialised only once.
  • A single copy of the original string is used.

This performed well enough for me but a solution using NSScanner might be more efficient.

Like the accepted answer, this solution doesn't address all the border cases requested by @lfalin. Those would be require much more expensive parsing which the average use case most likely doesn't need.

hpique
  • 119,096
  • 131
  • 338
  • 476
5

Without a loop (at least on our side) :

- (NSString *)removeHTML {

    static NSRegularExpression *regexp;
    static dispatch_once_t onceToken;
    dispatch_once(&onceToken, ^{
        regexp = [NSRegularExpression regularExpressionWithPattern:@"<[^>]+>" options:kNilOptions error:nil];
    });

    return [regexp stringByReplacingMatchesInString:self
                                            options:kNilOptions
                                              range:NSMakeRange(0, self.length)
                                       withTemplate:@""];
}
Rémy
  • 1,091
  • 9
  • 17
5
NSAttributedString *str=[[NSAttributedString alloc] initWithData:[trimmedString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]} documentAttributes:nil error:nil];
Robert
  • 5,278
  • 43
  • 65
  • 115
Pavan Sisode
  • 61
  • 1
  • 8
  • When we have the meta data with HTML tags and wants to apply that tags, that time we should apply the above code to achive the desire output. – Pavan Sisode Jun 13 '15 at 14:40
4
#import "RegexKitLite.h"

string text = [html stringByReplacingOccurrencesOfRegex:@"<[^>]+>" withString:@""]
sra
  • 23,820
  • 7
  • 55
  • 89
Jim Liu
  • 372
  • 3
  • 6
  • 2
    HTML isn't a regular language so you shouldn't be trying to parse/strip it with a regular expression. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – csaunders Dec 07 '11 at 18:34
3

I've extended the answer by m.kocikowski and tried to make it a bit more efficient by using an NSMutableString. I've also structured it for use in a static Utils class (I know a Category is probably the best design though), and removed the autorelease so it compiles in an ARC project.

Included here in case anybody finds it useful.

.h

+ (NSString *)stringByStrippingHTML:(NSString *)inputString;

.m

+ (NSString *)stringByStrippingHTML:(NSString *)inputString 
{
  NSMutableString *outString;

  if (inputString)
  {
    outString = [[NSMutableString alloc] initWithString:inputString];

    if ([inputString length] > 0)
    {
      NSRange r;

      while ((r = [outString rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
      {
        [outString deleteCharactersInRange:r];
      }      
    }
  }

  return outString; 
}
Dan J
  • 25,433
  • 17
  • 100
  • 173
3

If you want to get the content without the html tags from the web page (HTML document) , then use this code inside the UIWebViewDidfinishLoading delegate method.

  NSString *myText = [webView stringByEvaluatingJavaScriptFromString:@"document.documentElement.textContent"];
Hemang
  • 26,840
  • 19
  • 119
  • 186
Biranchi
  • 16,120
  • 23
  • 124
  • 161
2

This is the modernization of m.kocikowski answer which removes whitespaces:

@implementation NSString (StripXMLTags)

- (NSString *)stripXMLTags
{
    NSRange r;
    NSString *s = [self copy];
    while ((r = [s rangeOfString:@"<[^>]+>\\s*" options:NSRegularExpressionSearch]).location != NSNotFound)
        s = [s stringByReplacingCharactersInRange:r withString:@""];
    return s;
}

@end
digipeople
  • 752
  • 6
  • 9
2

I would imagine the safest way would just be to parse for <>s, no? Loop through the entire string, and copy anything not enclosed in <>s to a new string.

Ben Gottlieb
  • 85,404
  • 22
  • 176
  • 172
2

Here's the swift version :

func stripHTMLFromString(string: String) -> String {
  var copy = string
  while let range = copy.rangeOfString("<[^>]+>", options: .RegularExpressionSearch) {
    copy = copy.stringByReplacingCharactersInRange(range, withString: "")
  }
  copy = copy.stringByReplacingOccurrencesOfString("&nbsp;", withString: " ")
  copy = copy.stringByReplacingOccurrencesOfString("&amp;", withString: "&")
  return copy
}
JohnVanDijk
  • 3,546
  • 1
  • 24
  • 27
1

following is the accepted answer, but instead of category, it is simple helper method with string passed into it. (thank you m.kocikowski)

-(NSString *) stringByStrippingHTML:(NSString*)originalString {
    NSRange r;
    NSString *s = [originalString copy];
    while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
        s = [s stringByReplacingCharactersInRange:r withString:@""];
    return s;
}
tmr
  • 1,500
  • 15
  • 22
0

Extending this more from m.kocikowski's and Dan J's answers with more explanation for newbies

1# First you have to create objective-c-categories to make the code useable in any class.

.h

@interface NSString (NAME_OF_CATEGORY)

- (NSString *)stringByStrippingHTML;

@end

.m

@implementation NSString (NAME_OF_CATEGORY)

- (NSString *)stringByStrippingHTML
{
NSMutableString *outString;
NSString *inputString = self;

if (inputString)
{
    outString = [[NSMutableString alloc] initWithString:inputString];

    if ([inputString length] > 0)
    {
        NSRange r;

        while ((r = [outString rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
        {
            [outString deleteCharactersInRange:r];
        }
    }
}

return outString;
}

@end

2# Then just import the .h file of the category class you've just created e.g.

#import "NSString+NAME_OF_CATEGORY.h"

3# Calling the Method.

NSString* sub = [result stringByStrippingHTML];
NSLog(@"%@", sub);

result is NSString I want to strip the tags from.

Ashoor
  • 1,358
  • 1
  • 9
  • 11
0

I have following the accepted answer by m.kocikowski and modified is slightly to make use of an autoreleasepool to cleanup all of the temporary strings that are created by stringByReplacingCharactersInRange

In the comment for this method it states, /* Replace characters in range with the specified string, returning new string. */

So, depending on the length of your XML you may be creating a huge pile of new autorelease strings which are not cleaned up until the end of the next @autoreleasepool. If you are unsure when that may happen or if a user action could repeatedly trigger many calls to this method before then you can just wrap this up in an @autoreleasepool. These can even be nested and used within loops where possible.

Apple's reference on @autoreleasepool states this... "If you write a loop that creates many temporary objects. You may use an autorelease pool block inside the loop to dispose of those objects before the next iteration. Using an autorelease pool block in the loop helps to reduce the maximum memory footprint of the application." I have not used it in the loop, but at least this method cleans up after itself now.

- (NSString *) stringByStrippingHTML {
    NSString *retVal;
    @autoreleasepool {
        NSRange r;
        NSString *s = [[self copy] autorelease];
        while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound) {
            s = [s stringByReplacingCharactersInRange:r withString:@""];
        }
        retVal = [s copy];
    } 
    // pool is drained, release s and all temp 
    // strings created by stringByReplacingCharactersInRange
    return retVal;
}
jcpennypincher
  • 3,970
  • 5
  • 31
  • 44
0

Another one way:

Interface:

-(NSString *) stringByStrippingHTML:(NSString*)inputString;

Implementation

(NSString *) stringByStrippingHTML:(NSString*)inputString
{ 
NSAttributedString *attrString = [[NSAttributedString alloc] initWithData:[inputString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)} documentAttributes:nil error:nil];
NSString *str= [attrString string]; 

//you can add here replacements as your needs:
    [str stringByReplacingOccurrencesOfString:@"[" withString:@""];
    [str stringByReplacingOccurrencesOfString:@"]" withString:@""];
    [str stringByReplacingOccurrencesOfString:@"\n" withString:@""];

    return str;
}

Realization

cell.exampleClass.text = [self stringByStrippingHTML:[exampleJSONParsingArray valueForKey: @"key"]];

or simple

NSString *myClearStr = [self stringByStrippingHTML:rudeStr];

Nike Kov
  • 12,630
  • 8
  • 75
  • 122
0

If you are willing to use Three20 framework, it has a category on NSString that adds stringByRemovingHTMLTags method. See NSStringAdditions.h in Three20Core subproject.

jarnoan
  • 4,309
  • 1
  • 19
  • 17
0

An updated answer for @m.kocikowski that works on recent iOS versions.

-(NSString *) stringByStrippingHTMLFromString:(NSString *)str {
NSRange range;
while ((range = [str rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
    str = [str stringByReplacingCharactersInRange:range withString:@""];
return str;

}

Ahmed Awad
  • 1,787
  • 3
  • 16
  • 22
-3

Here's a blog post that discusses a couple of libraries available for stripping HTML http://sugarmaplesoftware.com/25/strip-html-tags/ Note the comments where others solutions are offered.

micco
  • 598
  • 1
  • 5
  • 10
  • This is the exact set of comments that I linked to in my question as an example of what would not work. – lfalin Nov 14 '08 at 03:59