1

This is nested about 10 functions deep, so I'll just paste the relevant bits:

This line is really slow:

var nodes = Filter_Chunk(Traverse(), chunks.First());

Specifically, this chunk inside Filter_Chunk (pun not intended):

private static IEnumerable<HtmlNode> Filter_Chunk(IEnumerable<HtmlNode> nodes, string selectorChunk)
{
    // ...
    string tagName = selectorChunk;
    foreach (var node in nodes)
        if (node.Name == tagName)
            yield return node;

There's nothing too complicated in there... so I'm thinking it must be the sheer number of nodes in Traverse() right?

public IEnumerable<HtmlNode> Traverse()
{
    foreach (var node in _context)
    {
        yield return node;
        foreach (var child in Children().Traverse())
            yield return child;
    }
}

public SharpQuery Children()
{
    return new SharpQuery(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
}

I tried finding <h3> nodes on stackoverflow.com. There shouldn't be more than a couple thousand nodes, should there? Why is this taking many minutes to complete?


Actually, there's definitely a bug in here somewhere that is causing it to return more nodes than it should... I forked the question to address the issue

Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • 1
    possible duplicate of [C# Performance of nested yield in a tree](http://stackoverflow.com/questions/1043050/c-performance-of-nested-yield-in-a-tree) – Jon Skeet Nov 09 '10 at 20:49
  • I can't give you any sort of specific answer, but I can point you to an interesting article on Joelonsoftware.com Down near the bottom Joel talks about the performance hit of using XML for large data sets. http://www.joelonsoftware.com/articles/fog0000000319.html – Jesse McCulloch Nov 09 '10 at 20:51
  • Just a guess: try to use a List instead of IEnumerable / yield and tell us if this helps. Reason for your problem might be the overhead of the state machine the compiler internally builds for indexers when using yield. – Doc Brown Nov 09 '10 at 21:01
  • @Jon/Doc: You're both wrong. That might improve performance a bit (and I appreciate the suggestions... I'll implement it once I find the bug) -- but there actually *is* a bug in there somewhere. It's traversing the same nodes more than once. – mpen Nov 09 '10 at 21:02

1 Answers1

2
public IEnumerable<HtmlNode> Traverse()
{
    foreach (var node in _context)
    {
        yield return node;
        foreach (var child in Children().Traverse())
            yield return child;
    }
}

This code looks strange to me. Children() is independent for _context, so it makes no sense to run over the children one time for each node in _context.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • You're right. I was trying to re-use a function I already had. Will accept to close question :) – mpen Nov 09 '10 at 22:49