1

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a string that contains this HTML markup:

string markup = @"
  <html>
    <head>
      ...
    </head>
    <body>
      <input id='text1' />
      <input id='blah' />
      <input id='text1' />
    </body>
  </html>
";

How can I check for duplicate ID names?

Community
  • 1
  • 1
Rod
  • 14,529
  • 31
  • 118
  • 230
  • you must read [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348) – CharlesB Sep 21 '12 at 14:18
  • You shouldn't use regex for this – Aurelio De Rosa Sep 21 '12 at 14:18
  • 1
    I get voted down for a problem that I need help with? I'm not trying to parse the entire markup. I am just looking for a pattern of id="sameid" that's it. Still not possible? I did read the link above. – Rod Sep 21 '12 at 14:20
  • 2
    @Rod I think some people reflexively downvote whenever they see 'HTML' and 'regex' in the same sentence. – Sean U Sep 21 '12 at 15:04
  • 1
    This question is in no way a duplicate of that other question (which exists more for comedic purpose than anything else at this point anyway). It is a legitimate question, as anybody who actually read it can see (rather than just voting to close because it mentions HTML and regex). At least link to a question that will be useful to the OP: [HTMLAgilityPack, HTML duplicate IDs](http://stackoverflow.com/q/2693350/803925) – nbrooks Sep 22 '12 at 19:39

2 Answers2

3

With the help of HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(markup);

var dups = doc.DocumentNode.Descendants()
    .Where(n => n.Attributes["id"] != null)
    .GroupBy(n => n.Attributes["id"].Value)
    .Select(g => new { ID = g.Key, Count = g.Count() })
    .Where(r=>r.Count>1)
    .ToList();
L.B
  • 114,136
  • 19
  • 178
  • 224
3

A regular expression might work, but only if the HTML is very regular. If you can't be sure of the number, types, formatting, and order of the attributes on those input tags, for example, then a regex-based solution to retrieving the information you want is going to be unwieldy at best, perhaps unworkable.

Better to use HTML Agility Pack. It will parse the HTML for you and spit out a tree representing the document structure. Then you can just traverse it looking for input tags and grab their ids if they have them:

HtmlDocument doc;
var inputTags = doc.DocumentNode.Descendants("input");
var ids = inputTags.Select(x=>x.Attributes["id"]);
Sean U
  • 6,730
  • 1
  • 24
  • 43