25

I have huge NSString with HTML text inside. The length of this string is more then 3.500.000 characters. How can i convert this HTML text to NSString with plain text inside. I was using scanner , but it works too slowly. Any idea ?

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Igor Prusyazhnyuk
  • 133
  • 2
  • 14
  • 29
  • 1
    possible duplicate of [Remove HTML Tags from an NSString on the iPhone](http://stackoverflow.com/questions/277055/remove-html-tags-from-an-nsstring-on-the-iphone) – hpique Mar 13 '14 at 14:53

7 Answers7

69

It depends what iOS version you are targeting. Since iOS7 there is a built-in method that will not only strip the HTML tags, but also put the formatting to the string:

Xcode 9/Swift 4

if let htmlStringData = htmlString.data(using: .utf8), let attributedString = try? NSAttributedString(data: htmlStringData, options: [.documentType : NSAttributedString.DocumentType.html], documentAttributes: nil) {
    print(attributedString)
}

You can even create an extension like this:

extension String {
    var htmlToAttributedString: NSAttributedString? {
        guard let data = self.data(using: .utf8) else {
            return nil
        }

        do {
            return try NSAttributedString(data: data, options: [.documentType : NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
        } catch {
            print("Cannot convert html string to attributed string: \(error)")
            return nil
        }
    }
}

Note that this sample code is using UTF8 encoding. You can even create a function instead of computed property and add the encoding as a parameter.

Swift 3

let attributedString = try NSAttributedString(data: htmlString.dataUsingEncoding(NSUTF8StringEncoding)!,
                                              options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
                                              documentAttributes: nil)

Objective-C

[[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]} documentAttributes:nil error:nil];

If you just need to remove everything between < and > (dirty way!!!), which might be problematic if you have these characters in the string, use this:

- (NSString *)stringByStrippingHTML {
   NSRange r;
   NSString *s = [[self copy] autorelease];
   while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
     s = [s stringByReplacingCharactersInRange:r withString:@""];
   return s;
}
o15a3d4l11s2
  • 3,969
  • 3
  • 29
  • 40
16

I resolve my question with scanner, but i use it not for all the text. I use it for every 10.000 text part, before i concatenate all parts together. My code below

-(NSString *)convertHTML:(NSString *)html {

    NSScanner *myScanner;
    NSString *text = nil;
    myScanner = [NSScanner scannerWithString:html];

    while ([myScanner isAtEnd] == NO) {

        [myScanner scanUpToString:@"<" intoString:NULL] ;

        [myScanner scanUpToString:@">" intoString:&text] ;

        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@""];
    }
    //
    html = [html stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

    return html;
}

Swift 4:

var htmlToString(html:String) -> String {
        var htmlStr =html;
        let scanner:Scanner = Scanner(string: htmlStr);
        var text:NSString? = nil;
        while scanner.isAtEnd == false {
            scanner.scanUpTo("<", into: nil);
            scanner.scanUpTo(">", into: &text);
            htmlStr = htmlStr.replacingOccurrences(of: "\(text ?? "")>", with: "");
        }
        htmlStr = htmlStr.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines);
        return htmlStr;
}
Mehul Thakkar
  • 12,440
  • 10
  • 52
  • 81
Igor Prusyazhnyuk
  • 133
  • 2
  • 14
  • 29
  • add a @autoreleasepool into the while loop for preserving memory – Rafael Gonçalves Jun 14 '15 at 20:43
  • Note: this will also replace anything between tags, so if you have an email address like "Some Name " it'll strip out . That's probably not what you want. It needs to possibly look up against a map of known HTML tags. – strangetimes May 10 '18 at 18:00
2

Objective C

+ (NSString*)textToHtml:(NSString*)htmlString
{
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&quot;" withString:@"\""];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&apos;" withString:@"'"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&amp;" withString:@"&"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&lt;" withString:@"<"];
    htmlString = [htmlString stringByReplacingOccurrencesOfString:@"&gt;" withString:@">"];
    return htmlString;
}

Hope this helps!

Dharmesh Mansata
  • 4,422
  • 1
  • 27
  • 33
1

For Swift Language ,

NSAttributedString(data:(htmlString as! String).dataUsingEncoding(NSUTF8StringEncoding, allowLossyConversion: true
            )!, options:[NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: NSNumber(unsignedLong: NSUTF8StringEncoding)], documentAttributes: nil, error: nil)!
Rabindra Nath Nandi
  • 1,433
  • 1
  • 15
  • 28
1
- (NSString *)stringByStrippingHTML:(NSString *)inputString
{
    NSMutableString *outString;

    if (inputString)
    {
        outString = [[NSMutableString alloc] initWithString:inputString];

        if ([inputString length] > 0)
        {
            NSRange r;

            while ((r = [outString rangeOfString:@"<[^>]+>|&nbsp;" options:NSRegularExpressionSearch]).location != NSNotFound)
            {
                [outString deleteCharactersInRange:r];
            }      
        }
    }

    return outString; 
}
Ahmed Abdallah
  • 2,338
  • 1
  • 19
  • 30
0

Swift 4:

do {
   let cleanString = try NSAttributedString(data: htmlContent.data(using: String.Encoding.utf8)!,
                                                                      options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
                                                                      documentAttributes: nil)
} catch {
    print("Something went wrong")
}
Josh O'Connor
  • 4,694
  • 7
  • 54
  • 98
0

It can be more generic by passing encoding type as parameter, but as example as this category:

@implementation NSString (CSExtension)

    - (NSString *)htmlToText {
        return [NSAttributedString.alloc
                initWithData:[self dataUsingEncoding:NSUnicodeStringEncoding]
                     options:@{NSDocumentTypeDocumentOption: NSHTMLTextDocumentType}
          documentAttributes:nil error:nil].string;
    }

@end
Renetik
  • 5,887
  • 1
  • 47
  • 66
  • in this method where you are passing string may be on self...? – Raviteja Mathangi Apr 30 '19 at 11:50
  • @Raviteja_DevObal Ah sorry this was category, i could be more clear , will edit ... – Renetik May 01 '19 at 00:15
  • But I don't believe this answer is correct anymore as there ir requirement of large html and this is terribly slow. I ended up using DTCoreText with some additional modifications for showing images correctly my solution is public on github though. – Renetik May 01 '19 at 00:21
  • This method is not converting dynamic HTML text from service.Means i don't know which HTML content is coming from service.But replacing with custom method's – Raviteja Mathangi May 01 '19 at 16:15
  • Sorry that was typo: But I don't believe this answer is NOT correct anymore as there is requirement of large html and this is terribly slow. I ended up using DTCoreText with some additional modifications for showing images correctly my solution is public on github though. – Renetik May 02 '19 at 00:53
  • I don't know what you are talking about... If you want to convert any html to text this works, the downside is that it's slow so I don't thing it will work for large html but maybe yes, depends on where you gona use it. – Renetik May 02 '19 at 00:56