-1

I'm trying to split my string containing html code with a regex expression:

NSString* regex = @"<.*?>";
NSString* html = @"<span class="test">Test1</span><span class="test">Test2</span><span class="test">Test3</span><span class="test">Test4</span>";

html = [html stringByReplacingOccurrencesOfString:regex withString:@""];

I want to delete the span-tags.

Any ideas?

NthDegree
  • 1,301
  • 2
  • 15
  • 29
  • 1
    Uh-oh, regex parsing of HTML again... please see [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – dreamlax Jan 09 '12 at 21:55
  • 1
    What happens when you use the code above? What's the output? What's the desired output? – PengOne Jan 09 '12 at 21:55
  • @PengOne: The `stringByReplacingOccurencesOfString:withString:` method only replaces literal instances of the string, i.e. each time it encounters `"<.*?>"` it will replace it with `@""`, but since `"<.*?>"` doesn't appear in the `html` string, `html` will remain the same. – dreamlax Jan 09 '12 at 22:01
  • @dreamlax: I don't think this counts as "parsing", since NthDegree isn't trying to extract semantic information. Barring malformed HTML (`<` or `>` inside another tag), it should be possible to construct a regex that just removes tags themselves. – jscs Jan 09 '12 at 22:07
  • @JoshCaswell: yeah this isn't as bad, but as long as OP knows that if it were to get trickier then they need to consider a different way of parsing. – dreamlax Jan 09 '12 at 22:10
  • 1
    @dreamlax the question was somewhat rhetorical with the goal of having the OP improve the quality of his question. – PengOne Jan 09 '12 at 23:46
  • Bottom line: doing text munging of HTML using anything that assumes tag syntax or structure is either a one-off hack or doomed to failure. – bbum Jan 10 '12 at 03:00

3 Answers3

5

You could probably do something like this with this method:

NSRegularExpression *re = [NSRegularExpression regularExpressionWithPattern:@"<.*?>"
                                                                    options:0
                                                                      error:NULL];

NSString *result = [re stringByReplacingMatchesInString:html
                                                options:0
                                                  range:NSMakeRange(0, [html length])
                                           withTemplate:@""];

Check what options you may need in the documentation in the link above.

dreamlax
  • 93,976
  • 29
  • 161
  • 209
1

This just removes < and > characters and everything between them, which I suppose is sufficient:

 (NSString *) stripTags:(NSString *)str
{
    NSMutableString *ms = [NSMutableString stringWithCapacity:[str length]];

    NSScanner *scanner = [NSScanner scannerWithString:str];
    [scanner setCharactersToBeSkipped:nil];
    NSString *s = nil;
    while (![scanner isAtEnd])
    {
        [scanner scanUpToString:@"<" intoString:&s];
        if (s != nil)
            [ms appendString:s];
        [scanner scanUpToString:@">" intoString:NULL];
        if (![scanner isAtEnd])
            [scanner setScanLocation:[scanner scanLocation]+1];
        s = nil;
    }

    return ms;
}
Taryn
  • 242,637
  • 56
  • 362
  • 405
shrestha2lt8
  • 305
  • 1
  • 9
1

If your input is HTML, the use an HTML PARSER.

"parsing" HTML with a regular expression is an exercise in futility. Note that there are plenty of questions on SO that describe HTML parsing on iO/OSX.

bbum
  • 162,346
  • 23
  • 271
  • 359