0

I've been reading into Xpath and trying to find my around with the Agility Pack but this has got me stumped.

HTML Snippet:

<table class="full_width list" cellspacing="0" cellpadding="0">

            <tbody><tr>
                <td class="w10">
                    <a href="add_to_slip_wap.t?marketId=28911263.2&amp;outcomeId=159043327.2&amp;numerator=8&amp;denominator=1&amp;handicap=&amp;priceType=EP&amp;ts=1435050302005">
                            2
                    </a>


                </td>

            </tr>

That's enclosed in some other tags but it's a pretty big lump of spam to post the whole thing up and I'm not sure how much is needed. I'm not able to embed images but I've got a screen of Chromes Developer Tools showing all of the tags: https://i.stack.imgur.com/BYHnR.jpg

This table repeats itself and I'm trying to loop around finding the contents of the w10 class. I've tried a lot of different variations but the one which makes sense to me (but obviously doesn't work) is:

 For Each node As HtmlNode In document.DocumentNode.SelectNodes("//div/table/tbody/tr/td[@class='w10']")
        MsgBox(node.InnerText)
    Next

Which returns a System.NullReferenceException. Specifically I'm looking for the anchor text (in this case 2) but variations of [@class='w10']//a don't seem to be working so I think I'm right in assuming it's gone wrong before that.

I looked at the code and followed it through down to find that class and hoped it would be as simple as that. Apparently not. I'm assuming I don't need to start all the way at the top at //html or something but trying to go straight for //[@class='w10'] didn't work either.

If anyone could give me a point in the right direction I'd appreciate it. A lot of the example code I'm finding is for single nodes and they're usually sitting right out in a //div[@class='classname']. Once it starts getting buried in tags I lose the ability to find it.

/Edit:

Big obvious thing I was missing is the xmlns says xhtml which means everything is now in a different namespace. If I figure out what I'm doing I'll update in case anyone is looking for the same kind of thing in future.

Max Better
  • 61
  • 10
  • `//[@class='w10']` is not syntactically valid XPath. Try `//td[@class='w10']`. – Tomalak Jun 23 '15 at 10:59
  • That was the first thing I'd tried thinking (hoping) it would have been as simple as that. Gave it another bash but System.NullReferenceException again. Thanks though. – Max Better Jun 23 '15 at 11:02
  • Could it be you are working with XHTML? – Tomalak Jun 23 '15 at 11:31
  • So I'm assuming I've done something really dumb here. xmlns="http://www.w3.org/1999/xhtml". Right so I'm assuming the xpath is going to be completely different to reading normal html. Sorry I should have known. – Max Better Jun 23 '15 at 11:42
  • You have to register the namespace and use it in your XPath. Never ignore `xmlns`, it's important. – Tomalak Jun 23 '15 at 11:43
  • I thought I'd cheat a bit and try and use a wildcard //*[@class="w10"] but apparently that's not going to fly. If I'm understanding this right it becomes something similar to /x:html/x:body/x:div for the path even though the source I'm seeing on Chrome is just
    ?
    – Max Better Jun 23 '15 at 12:07
  • Hm, it appears that the Agility Pack ignores namespaces (see http://stackoverflow.com/questions/7772872/cant-figure-how-to-parse-using-html-agility-pack). What does `//body` give you? – Tomalak Jun 23 '15 at 12:16
  • That would be handy. It seems to dump the data from //body alright: http://imgur.com/NsxjDpj – Max Better Jun 23 '15 at 12:27
  • So when `//body` works and `//td` works (I assume) then `//td[@class='w10']` ought to work, too. Try `//td[contains(@class, 'w10')]` (or, to prevent false positives, `//td[contains(concat(' ', @class, ' '), ' w10 ')]`). – Tomalak Jun 23 '15 at 12:31
  • //td is a System.NullReferenceException as well. So body works and td doesn't so the path is failing somewhere between those points. That still brings me back to square one as the original //div/table/tbody/tr/td didn't work either. :/ – Max Better Jun 23 '15 at 13:53
  • So I would assume that the table is either loaded dynamically by JavaScript or that it sits in an iframe. – Tomalak Jun 23 '15 at 13:59
  • I can't see an iFrame or Script tag. I've been able to scrape the data through a browser DOM reliably so it's always there in the same place. I'm probably missing something really, really obvious on the code or my xpath syntax. Actually from what I've seen of xpath so far I'm going to go with 'relatively' obvious. – Max Better Jun 23 '15 at 14:09
  • Well if `//td` comes up empty then there's no `` in the document, it's as simple as that. Can you share the URL? – Tomalak Jun 23 '15 at 14:13
  • https://mobile.stanjames.com/event_wap.t?eventId=4637482.2&marketId=28911449.2&ts=1435068897033 the bit I'm trying to find is the selection number on the left hand side I'm probably missing something really obvious here. – Max Better Jun 23 '15 at 14:16
  • There is not a single `` in that document. I see a "Place Bet" button and I get a message to log in when I press it. I suppose when I'm logged in the document has additional content. But your HTML agility pack *isn't* logged in. ;) General tip: Always look at source code from an incognito mode browser window in such cases. – Tomalak Jun 23 '15 at 14:23
  • Strange because it doesn't need to be logged in and displays the data just fine if I scrape it through the browser DOM without logging in by just going direct to that link. Maybe it's handling the socket request differently but that wouldn't explain why it wouldn't display for you... And I think I was dumping divs or something and the odds were showing there. I'll have to take a look and see what's going on. Thanks for trying. – Max Better Jun 23 '15 at 14:35
  • You will probably have a cookie set that I don't have. As I said, try in a private/incognito browser window. – Tomalak Jun 23 '15 at 14:37
  • First time loaded blank without the odds. Clicked again and incognito loads with the data now as well. I'd spotted something like that happening before and assumed it was some server lag but maybe it needs to load something the first time. I guess the html agility site load wouldn't take whatever it needs into account... oh dear. – Max Better Jun 23 '15 at 14:48
  • Maybe try a [headless browser](http://stackoverflow.com/questions/10161413/headless-browser-for-c-sharp-net) instead of the Agility Pack? That would give you the option of state creation and tracking (i.e. clicking buttons, filling in forms, keeping cookies) – Tomalak Jun 23 '15 at 14:53
  • 1
    The hope was that using a socket instead of a browser control would decrease load times and make it easier to multi-thread. That said the hope was also that this would be a lot easier to implement so hey. Thanks for taking your time I do appreciate it. At least I know that the xpath syntax should be working just need to find a workaround to get it to load properly. – Max Better Jun 23 '15 at 15:10
  • Well, you can, after you figure out what makes the server send the proper HTML. :) – Tomalak Jun 23 '15 at 15:45

0 Answers0