5

how can I remove, with NSRegularExpression, the entire head-tag in a HTML file. Can some one give me a regex?

Thanks in advance, Ph99Ph

3 Answers3

17

There is none! HTML is a type-2 language and thus not parsable with a regular expression (type-3).

See this wiki article in case of doubt.

Lots of people use regex for parsing/editing HTML. This works quite well in simple cases but is utterly error prone.

This being said: You should have fairly reliable results with this regex:

<head>.+?</head>

This requires "." to also match line breaks. If it doesn't, then use this:

<head>(?:.|\n|\r)+?</head>

Again: This is error prone, don't do it.

What you should use is an XML parser such as NSXMLParser.

Tim S. Van Haren
  • 8,861
  • 2
  • 30
  • 34
Regexident
  • 29,441
  • 10
  • 93
  • 100
  • 2
    To be fair, this is a common misconception endorsed in particular by the waste amount of ill informed online articles on parsing/validating/editing html with regex. I can only whole-heartedly recommend to read the wiki article that I liked to in my answer. Understanding language complexity is a big thing. Basically omnipresent in computer science/programming. Well worth the read. – Regexident Apr 07 '11 at 19:30
  • To also match line breaks a modifier can be used: /.*<\/head>/s – Felix Eve Apr 03 '13 at 09:18
  • @FelixEve: `NSRegularExpression` isn't like **PHP/PERL/…** where **regex** are defined like `/pattern/flags`/`/pattern/template/flags`. Instead you'd have to provide the `NSRegularExpressionDotMatchesLineSeparators` bitmask option to it. – Regexident Apr 03 '13 at 09:39
  • One comment, should not the slash in the closing tag be scaped? `(?:.|\n|\r)+?<\/head>` – aUXcoder Aug 11 '16 at 14:38
  • @aUXcoder: That depends on whether the programming language you're using uses `/…/` literals for regex (in which case you'd be right of course). – Regexident Aug 11 '16 at 14:40
3

Please see the accepted answer at RegEx match open tags except XHTML self-contained tags. Or any version of this exact same question posted each day since the beginning of Stack Overflow.

In short, you cannot reliably parse HTML with Regular Expressions. RegEx is simply not advanced enough because of the complexities of HTML.

Community
  • 1
  • 1
Devin Burke
  • 13,642
  • 12
  • 55
  • 82
0

use something like this :

result = System.Text.RegularExpressions.Regex.Replace(result,
         @"<( )*head([^>])*>", "<head>",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
         @"(<( )*(/)( )*head( )*>)", "</head>",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);                
result = System.Text.RegularExpressions.Regex.Replace(result,
         "(<head>).*(</head>)", " ",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Fahim Parkar
  • 30,974
  • 45
  • 160
  • 276
hamed
  • 1
  • 1
  • 4