4

There's a bug in Traverse() that's causing it to iterate nodes more than once.

Bugged Code

public IEnumerable<HtmlNode> Traverse()
{
    foreach (var node in _context)
    {
        yield return node;
        foreach (var child in Children().Traverse())
            yield return child;
    }
}

public SharpQuery Children()
{
    return new SharpQuery(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
}

public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
    if (nodes == null) throw new ArgumentNullException("nodes");
    _previous = previous;
    _context = new List<HtmlNode>(nodes);
}

Test Code

    static void Main(string[] args)
    {
        var sq = new SharpQuery(@"
<a>
    <b>
        <c/>
        <d/>
        <e/>
        <f>
            <g/>
            <h/>
            <i/>
        </f>
    </b>
</a>");
        var nodes = sq.Traverse();
        Console.WriteLine("{0} nodes: {1}", nodes.Count(), string.Join(",", nodes.Select(n => n.Name)));
        Console.ReadLine();

Output

19 nodes: #document,a,b,c,g,h,i,d,g,h,i,e,g,h,i,f,g,h,i

Expected Output

Each letter a-i printed once.

Can't seem to figure out where's it going wrong... node.ChildNodes does return just direct children, right? (from HtmlAgilityPack)


Full class (condensed) if you want to try and run it yourself.

public class SQLite
{
    private readonly List<HtmlNode> _context = new List<HtmlNode>();
    private readonly SQLite _previous = null;

    public SQLite(string html)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        _context.Add(doc.DocumentNode);
    }

    public SQLite(IEnumerable<HtmlNode> nodes, SQLite previous = null)
    {
        if (nodes == null) throw new ArgumentNullException("nodes");
        _previous = previous;
        _context = new List<HtmlNode>(nodes);
    }

    public IEnumerable<HtmlNode> Traverse()
    {
        foreach (var node in _context)
        {
            yield return node;
            foreach (var child in Children().Traverse())
                yield return child;
        }
    }

    public SQLite Children()
    {
        return new SQLite(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
    }
}
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • 2
    What did the debugger tell you? – Oliver Charlesworth Nov 09 '10 at 21:29
  • @Oli: About what? Where am I supposed to put a break point? It's a logic error, not a crashing error. – mpen Nov 09 '10 at 21:44
  • 2
    @Mark: Not sure I understand. Debuggers aren't just for diagnosing crashes! – Oliver Charlesworth Nov 09 '10 at 21:49
  • Can you post the whole SharpQuery class? – BFree Nov 09 '10 at 21:51
  • @BFree: It's 675 lines and less than a quarter complete >.< If you really want I'll make a condensed one. – mpen Nov 09 '10 at 21:54
  • @Oli: No... but I still have no idea where I'd put a breakpoint. The debugger told me it was getting stuck in the traverse loop. What more can it possibly tell me that I don't already know? – mpen Nov 09 '10 at 22:26
  • @Mark: Well, can you not step through and see *why* it's getting stuck? i.e. at what point does your algorithm behaviour diverge from what you were expecting? – Oliver Charlesworth Nov 09 '10 at 22:29
  • @Oli: Why do I need the debugger to tell me that? I posted the output and expected output... it very clearly diverges after "c" (next letter should be "d"). And btw, I did use the debugger to locate and isolate the problem, but I find it more helpful to actually print the output than digging through a call stack and stepping back and forth to see what the heck is going wrong. Of course, you're right that the debugger could have told me `_context` contained the wrong values... had I thought to look there. – mpen Nov 10 '10 at 02:38
  • @Mark: Your output tells you when your *output* diverged (i.e. the symptom), it doesn't tell you where the *algorithm* diverged (i.e. the cause). Really, the debugger should be your first port of call; it will help you isolate 90% of problems. Posting your code here is essentially asking others to do your job for you (although on this occasion, someone did that successfully), and I'm afraid this just comes across as laziness. Please learn to use a debugger effectively in future! – Oliver Charlesworth Nov 10 '10 at 08:53
  • @Oli: It's 5 lines of code. I know it diverged somewhere in there, how much more specific can you get? And it's not being lazy. I did try solving the problem first, and I continued to try and solve it after I posted the question, and I didn't copy-paste the answers people gave me, I learned from them and then implemented a solution my way. People like doing this kind of stuff or they wouldn't be on the site, and they wouldn't have given me any answers. If they think I'm being lazy, they don't need to answer. – mpen Nov 10 '10 at 18:26
  • @Oli: I work solo, not in a big office full of peers. Unfortunately, StackOverflow is one of the only places I can come to ask questions and get help on these kinds of things. Anyway, you've made up your mind about how ineffectively I use the debugger so... there's not much more I can say. – mpen Nov 10 '10 at 18:28
  • @Mark: Again, by stepping through line-by-line (or perhaps call-by-call) until the contents of the variables no longer match your expectations. That is the basic premise of debugging. I'd understand your point if your dataset was millions of items, or if it took hours of runtime before the bug appeared, or you had complex I/O interaction, or if your app was multithreaded; these can be horrifically hard to debug. But you have none of these! IMHO, this site is not for "please fix my code"-type questions, but for knowledge-sharing. – Oliver Charlesworth Nov 10 '10 at 18:33
  • @Oli: Bugs are a natural part of programming. So are algorithms. If you don't like to idea of me asking about a bug, then think of it as a "how do I write a tree traversal algorithm?". At least I've given a shot before I came here to ask. Original dataset *did* take many minutes to complete and the debugger was indeed useless... I had to chop it down to size. And all the data is in enumerables, which are hard to see. It will diverge on the "yield return node" statement specifically, because that's what gives the wrong output, but that's not what is causing the issue. – mpen Nov 10 '10 at 18:38
  • @Mark: My previous comments were not intended to be elitist put-downs, apologies if they came across this way. But seriously, learning how to debug is one of the most powerful skills you can acquire. – Oliver Charlesworth Nov 10 '10 at 18:38

6 Answers6

4

First, note that everything goes according to plan as long as we're iterating over elements that don't have any sibling. As soon as we hit element <c>, things start going haywire. It's also interesting that <c>, <d> and <e> all think they contain <f>'s children.

Let's take a closer look at your call to SelectMany():

_context.SelectMany(n => n.ChildNodes)

That iterates through the items in _context and accumulates the child nodes of each item. Let's have a look at _context. Everything should be okay, since it's a List of length 1. Or is it?

I suspect your SharpQuery(string) constructor stores sibling elements inside the same list. In that case, _context may not be of length 1 anymore and, remember, SelectMany() accumulates the child nodes of each item in the list.

For example, if _context is a list containing <c>, <d>, <e> and <f>, only <f> has children, and SelectMany() is called once for each element, it will accumulate and return the children of <f> four times.

I think that's your bug.

EDIT: Since you've posted the full class, I don't have to guess anymore. Look at the sequence of operations when you iterate over <b> (iterators replaced by lists for better comprehension):

  1. Call Children() on <b>, which returns <c>, <d>, <e> and <f>,
  2. Call Traverse() on the resulting list:
    1. Call Children() on <c>, but _context actually contains <c>, <d>, <e> and <f>, not only <c>, so that returns <g>, <h> and <i>,
    2. Call Children() on <d>, same thing since _context is the same for both <c> and <d> (and <e>, and <f>),
    3. Lather, rinse, repeat.
Frédéric Hamidi
  • 258,201
  • 41
  • 486
  • 479
  • I just posted the string constructor... it only adds the document node to the context, and nothing else. – mpen Nov 09 '10 at 22:04
  • Your problem is that sibling nodes share the same `_context`, and that doesn't work well with `SelectMany()`. See my updated answer. – Frédéric Hamidi Nov 09 '10 at 22:27
  • Well, that explains the bug. Thanks :) – mpen Nov 09 '10 at 22:36
  • You posted the same link twice ;) – mpen Nov 09 '10 at 22:48
  • Whoops, my bad :) Since you'll have to rewrite that, you might want to remove the yield return from your loop and to use e.g. [an item stack](http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx) to avoid creating too many iterators. See the accepted answer to [this question](http://stackoverflow.com/questions/1043050/c-performance-of-nested-yield-in-a-tree). – Frédéric Hamidi Nov 09 '10 at 22:50
3

Children() is broken, it selects children of siblings as well. I'd rewrite to something like this:

public IEnumerable<HtmlNode> Traverse(IEnumerable<HtmlNode> nodes)
{
    foreach (var node in nodes)
    {
        yield return node;
        foreach (var child in Traverse(ChildNodes(node)))
            yield return child;
    }
}

private IEnumerable<HtmlNode> ChildNodes(HtmlNode node)
{
    return node.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element);
}


public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
    if (nodes == null) throw new ArgumentNullException("nodes");
    _previous = previous; // Is this necessary?
    _context = new List<HtmlNode>(nodes);
}
Robert Jeppesen
  • 7,837
  • 3
  • 35
  • 50
  • Excellent! I don't like making the functions take in arguments, but you gave me the right idea.. thank you! And yes, _previous is necessary, it's used by other functions. – mpen Nov 09 '10 at 22:44
2

Shouldn't this be enough?

public IEnumerable<HtmlNode> Traverse()
{
    foreach (var node in _context)
    {
        Console.WriteLine(node.Name);
        yield return node;
        foreach (var child in Children().Traverse())
            {} //yield return child;
    }
}
  • 1
    Nope... you're traversing the full tree but not returning any of the results beyond the first level. At any level deeper than the first one, you are yielding the correct nodes, but then not doing anything with them (they're not passed to the calling function, just to depth-1). – mpen Nov 09 '10 at 21:39
  • 2
    To be fair, the you used to have `Console.WriteLine(node.Name)` inside the IEnumerable. Now that it is outside you are correct. – Scott Chamberlain Nov 09 '10 at 21:46
  • @Scott: Well... that Console.WriteLine was just for debugging... I moved it outside to not confuse issues. But you're right.. I guess it would have produced the correct output in this case, but the point isn't to print it, it's to actually return it ;) – mpen Nov 09 '10 at 22:01
1

I can't tell exactly, but you can see the pattern is that once your stuff encounter the "/" of the "c", it start to consider the "g,h,i" to be part of every subsequent letter until it encounter the "/" of the "f".

Most likely you have an extra loop that you should get ride of.

Or for some reason, it doesnt compute correctly the "/>" parts.

Wildhorn
  • 926
  • 1
  • 11
  • 30
  • It's not the self-closing nodes. I tried expanding them too. Where's the extra loop? That's the full code... – mpen Nov 09 '10 at 21:47
1

I would do something like this and see if it fixes the issue:

public IEnumerable<HtmlNode> Traverse()
{
foreach (var node in _context)
{
    Console.WriteLine(node.Name);
    if(!node.HasChildren) {
     yield return node;
    }
    else {
    foreach (var child in Children().Traverse())
        yield return child;
    }
    }
}
}
dexter
  • 7,063
  • 9
  • 54
  • 71
0
public IEnumerable<HtmlNode> All()
{
    var queue = new Queue<HtmlNode>(_context);

    while (queue.Count > 0)
    {
        var node = queue.Dequeue();
        yield return node;
        foreach (var next in node.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
            queue.Enqueue(next);
    }
}

courtesy link

Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236