0

I have a string like this

String text =

<p><span><span id="test">Meanwhile, the Cougars are coming off of a win against Eastern 
Washington University in which they scored 88 points and had three players score at least 
15 points. <span>Motum</span> recorded his fourth career double-double in the game as well.  
</span></span></p> 

<p><span>After Dexter Kernich-Drew, Royce Woolridge, and Will DiIorio were unable to 
practice last Wednesday before the game against EWU, the team is healthy and ready to play 
against Utah Valley. </span></p>


<p><span><span><span>Woolridge</span>, a <span>redshirt</span> sophomore transfer who has 
started at guard in the first two games this season, scored seven points and had two assists 
against EWU. He also had 10 points and three assists against Saint Martin&rsquo;s. </span> 
</span></p>

And I need to get rid of all 's that have no attributes and are just wraping content. The pattern i have so far is

text = Regex.Replace(text, @"</?span([^>]*|/)?>", "", RegexOptions.Compiled);

which just pulls all spans out leaving

<p>Meanwhile, the Cougars are coming off of a win against Eastern Washington University 
in which they scored 88 points and had three players score at least 15 points. Motum 
recorded his fourth career double-double in the game as well. </p> 

<p>After Dexter Kernich-Drew, Royce Woolridge, and Will DiIorio were unable to practice 
last Wednesday before the game against EWU, the team is healthy and ready to play 
against Utah Valley. </p> 

<p>Woolridge, a redshirt sophomore transfer who has started at guard in the first 
two games this season, scored seven points and had two assists against EWU. He also had 
10 points and three assists against Saint Martin&rsquo;s. </p>

That is close but i needed the first

that had in it to look like

<p><span id="test">Meanwhile, the Cougars are coming off of a win against Eastern 
Washington University in which they scored 88 points and had three players score at 
least 15 points. Motum recorded his fourth career double-double in the game as well. 
</span></p>

The question here is how to find nested spans that don't have attributes and remove them. I did have a few other tries that use back traces for the end tag but this has been the only one that has been the closest.

Quantum
  • 1,456
  • 3
  • 26
  • 54
  • 4
    A good read - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. You should consider using an html parser. – Mike Park Nov 15 '12 at 20:36
  • 1
    It sounds like it's not working well because regex doesn't fit this problem well. http://stackoverflow.com/a/1732454/2009. Consider [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/)? – hometoast Nov 15 '12 at 20:36
  • ok fair enough, then what is a light weight way to approch this in c#? – Quantum Nov 15 '12 at 20:43
  • @jeremyBass_DC using a HTML Parser. Certainly not with regex – Cristian Lupascu Nov 15 '12 at 20:44
  • @Vadim that may lead to trouble as well. I think some HTML standards are not fully XML-compliant. For example, the `` tag can have no matching ``. – Cristian Lupascu Nov 15 '12 at 20:47
  • but an empty xml node is legal right? ie: so if valid xHMTL is work.. yes? – Quantum Nov 15 '12 at 20:51
  • @Vadim You can not parse html with xml parser. (html != xhtml) – L.B Nov 15 '12 at 20:53

2 Answers2

0
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var spans = doc.DocumentNode.SelectNodes("//span[@*]")
                .Select(s => s.InnerText)
                .ToList();
L.B
  • 114,136
  • 19
  • 178
  • 224
0

Here's some pseudocode for a simple algorithm:

create a stack of booleans

set the last position to the start of the text

search for the opening and the closing spans and for each one found:
    append the text since the last position up to the start of the found item to the output

    if the found item is an opening span:
        if the found item has attributes:
            // it's an opening span with attributes
            // we want to keep it
            push true onto the stack
            append the item to the output
        else:
            // it's an opening span without attributes
            // we want to drop it
            push false onto the stack
    else:
        pop the top boolean from the stack
        if the popped boolean is true:
            // the corresponding opening span had attributes
            // we want to keep this closing span
            append the found item to the output

    set the last position to the end of the found item

append the remaining text since the last position to the output
MRAB
  • 20,356
  • 6
  • 40
  • 33