1

Having some trouble with this one.

Trying to do some basic syntax highlighting for a custom file. Need to know if an element is inside a tag.

Some sample data

<span class="class1"> 
    Some Text <span class="class2">Some More Text</span>
    TEST
    <span>Text</span>
</span>
TEST

What I want to do here, is find the occurances of TEST that are not nested in a span tag.

The first one should not match, as it is nested inside class1, the second tag should match, because it isn't nested in any span tags.

the first test should show it's nested in a span tag, the second should show it's not.

I know regex is not meant to be used to parse html, but for my little situation, I thought using regex would be easiest, as I don't know another way to do what I'm looking for. I'm not against using XPath if it can solve this problem quickly.

In my code all I want is a method like this

bool InsideSpanTag(string source, int index);

this would return true if index is inbetween some span tags in the string source, and false if it's not.

EDIT: Nevermind, I'll just count the opening and closing span tags to the left of the index and see if the number of opening span tags are greater than the closing tags. Kinda quick and dirty but it's really all I needed.

Kyle Gobel
  • 5,530
  • 9
  • 45
  • 68
  • 10
    You shouldn't be using regular expressions to parse HTML. – p.s.w.g Jun 27 '13 at 16:57
  • You need to be more clear... what should show as nested? The text? The class2 span element? What are you trying to match? p.s.w.g is also correct... can you use XPath? – Derek Jun 27 '13 at 16:57
  • 4
    I'm just going to leave this here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Nathaniel Ford Jun 27 '13 at 17:06
  • 1
    Use an XML parser. That's what they're for. –  Jun 27 '13 at 17:14
  • 2
    The trouble with using an xml parser is that perfectly valid HTML is not necessarily a regular expression that can be consumed by a general lexical parser. Literally; this question has been asked many, many times and the answer is: this way lays madness. With sufficient constraint (did I run into a span tag? stop!) you can probably do this particular thing, but it's *very hard* to solve generally. And regex is usually a general solution. – Nathaniel Ford Jun 27 '13 at 17:17
  • @Kyle: What programming language are you using, in what environment? Do you have a HTML markup string, or is it a DOM already? – Bergi Jun 27 '13 at 17:22
  • @Bergi this is C#, it's just an html markup string – Kyle Gobel Jun 27 '13 at 17:23
  • Check this post: http://stackoverflow.com/a/6063226/426422 – Mike Cheel Jun 27 '13 at 17:31
  • possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – Fabian Bigler Jun 27 '13 at 17:34
  • @JackManey xml parser is used for parsing xml not html..even if it's `xhtml` still you won't be able to parse it using xml parser – Anirudha Jun 27 '13 at 17:47

1 Answers1

5

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format.(except xhtml)

Use htmlagilitypack

Here's your code

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtmlString);

bool valid= doc.DocumentNode
               .SelectNodes("//text()[not(parent::span)]")//this xpath selects all nodes whose parent is not span
               .Any(p => p.InnerText.Contains("TEXT"));
carla
  • 1,970
  • 1
  • 31
  • 44
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • 1
    +1 for `HTMLAgilityPack`. I have used it successfully for parsing real-world HTML content in a pretty large project. – dotNET Jun 27 '13 at 17:44
  • I'm sure it is good, but isn't this sorta overkill for what i needed? I should have specified in my post more info about what I needed and how 'accurate' (lack of a better word) and efficient it needed to be. I'll be sure to check out HtmlAgilityPack for html parsing in the future. – Kyle Gobel Jun 27 '13 at 17:49