0

I have a string containing something like this :

string text = "<p>test <span> <font> here </font> </span> try</p><p> <font> try 2</font> </p>"

What I need is to filter something like this :

Keep Text inside P Remove Span and content (font and text) Keep Text inside font if its direct parent is not a Span*

What I have is :

StringBuilder sbtexttoCorrect = new StringBuilder();
HtmlDocument html = new HtmlDocument();
html.LoadHtml(textToFormat);
var nodes = html.DocumentNode.SelectNodes("//p");

foreach (var line in nodes)
{
   if (line.Name =="SPAN")
   {
      line.RemoveAllChildren();
      line.Remove();
   }
}
foreach (var txt in nodes)
{
sbtexttoCorrect.Append(txt.InnerText);
}

But the sbtexttoCorrect at then end still gets the child font of the span. Even with the Removechild and his own Remove.

What am I missing?

Note : on another post someone told me :

 foreach (var line in nodes.Select(node => node.ChildNodes.Where(
     childNode => childNode.Name != "span"))
                    .Select(
                        textNodes => textNodes.Aggregate(String.Empty, (current, node) => current + node.InnerText)))
        {

            sbtexttoCorrect.Append(line);
        }

But I do not understand all of the syntax so I wanted to rewrite my own try, plus it did not work all the time too, it is still getting the text inside the Font inside the Span.

Note 2 I can't find any doc on the specification of the Agilty Pack. If someone knows where to find it, I'd like to learn more about this library.

Edit The real HTML is way more complexe, with a number of childNode that I can't know for sur, they can be TD or DIV, the only thing really sure is when there is a span I need to skip his content and his childNode

Slayner
  • 399
  • 1
  • 6
  • 22
  • I edited my answer. It now removes all spans independently at which level they are in the html – Fabian Jan 11 '16 at 15:17

1 Answers1

1

I see these problems in your code:

  • You treat the span as UpperCase whereas HtmlAgilityPack handles it as LowerCase => your if block will never hit
  • You only loop on the p elements (instead on the childs of p elements) => your if block will never hit

Based on your additional explications this should work:

  • It selects all spans with an XPath (so should work for upper and lower case)
  • It removes the spans
  • It cleans all html elements (as indicated here)

    string text = "<p>test <SPAN> <font> here </font> </SPAN> try</p><p><table> <tr><td><span>test</span></td></tr></table><font> try 2</font> </p>";
    StringBuilder sbtexttoCorrect = new StringBuilder();
    HtmlDocument html = new HtmlDocument();
    html.LoadHtml(text);
    var nodes = html.DocumentNode.SelectNodes("//span");
    
    foreach (var node in nodes)
    {
        node.Remove();
    }
    
    foreach (var node in html.DocumentNode.DescendantsAndSelf())
    {
        if (!node.HasChildNodes)
        {
            string t = node.InnerText;
            if (!string.IsNullOrEmpty(t))
                sbtexttoCorrect.AppendLine(t);
        }
    }
    
Community
  • 1
  • 1
Fabian
  • 1,886
  • 14
  • 13
  • well the string I use might have the node written in UpperCase but i tryed it with LowerCase too. In your case, the text inside the inside the won't be taken because you don't leap througt the childnode ? – Slayner Jan 11 '16 at 14:15
  • Yes and isn't that what you want? "Keep Text inside P Remove Span and content (font and text) Keep Text inside font if its direct parent is not a Span*" – Fabian Jan 11 '16 at 14:17
  • yes this is what I wanted, but that means that in the case I got a table, he wont got through all the table and watch all the child, he will only iterate over the two first level of childnode ? – Slayner Jan 11 '16 at 14:21
  • After trying it, it still get the Text inside the Font of the Span where he sould not be – Slayner Jan 11 '16 at 14:26
  • The code I posted gives "test try try 2" . You say you don't get the same output ? – Fabian Jan 11 '16 at 14:31
  • Yes, I get : tes here test try try 2; I sometimes get two times the value behing Append to my stringBuilder. and it's like the first loop take all, even the font inside the span then a second loop occur where it take only what we said – Slayner Jan 11 '16 at 14:36
  • It must be something in your code or the way you integrated the code. Can you copy and paste only the code from my answer to a command line project and execute it ? – Fabian Jan 11 '16 at 14:41
  • Well it is the only thing I do, i only do HtmlEntity.DeEntitize(sbtexttoCorrect.ToString()); before returning what I have in my string builder. This is why i don't get it – Slayner Jan 11 '16 at 14:45
  • This issue might come from the construction of the HTMLBecause it's acutally way more complexe then my exemple, so i think I need more iteration then only 2, that might be Why He still get the span and his font some times – Slayner Jan 11 '16 at 14:47
  • Sorry. I removed my remark. Seems to work just fine. – Fabian Jan 11 '16 at 15:20
  • Dude that is totally this, Thanks a lot, I might modify it a little now that i will be able to understand how to go through all element so he match some specifique case. – Slayner Jan 11 '16 at 15:21