1

My problem is that I can't get div InnerText from table. I have successfully extraced different kind of data, but i don't know how to read div from table.

In following picture I've highlighted div, and I need to get InnerText from it, in this case - number 3.

Click here for first picture

I'm trying to accomplish this using following path:

"//div[@class='kal']//table//tr[2]/td[1]/div[@class='cipars']"

But I'm getting following Error:

Click here for Error message picture

Assuming that rest of the code is written correctly, could anyone point me in the right direction ? I have been trying to figure this one out, but i can't get any results.

Arran
  • 24,648
  • 6
  • 68
  • 78
  • 1) `dateNode` is probably null and 2) Post code. Not pictures of your code. Thanks. :) – Arran Sep 25 '13 at 14:30
  • Sorry for the them links to pictures, I thought that other code was irrelevant to solving my problem, i posted them to help me explain my problem. The reason I'm asking for help is because I know that error says that dateNode is null, but I think the path I'm using is wrong. I just don't know where is the problem with it. – Nauris Andzans Sep 25 '13 at 14:36
  • Usually with these issues, it's hard to know what the issue is. I can see you have an XPath query, I can see it's returning a null object when HtmlAgilityPack is running it, but how can I see if this query is right? I don't have any reference XML/HTML to go on. I don't have any C# code to show what code you are running. Your picture shows the code seems fine, so it's probably the physical XPath query. – Arran Sep 25 '13 at 14:53
  • The picture with HTML is from this page http://lekcijas.va.lv/?nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=#tabs-1 and the rest of C# code is correct, because I have used it before many times, but this is the first time I need XPath to object in table. So it's XPath that causes error.. but it seems fine to me? Any syntax errors from my part? – Nauris Andzans Sep 25 '13 at 15:02

1 Answers1

1

So your problem is that you are relying on positions within your XPath. Whilst this can be OK in some cases, it is not here, because you are expecting the first td in a given tr to have a div with the class.

Looking at the source in Chrome, it shows this is not always the case. You can see this by comparing the "1" element in the calendar, to "2" and "3". You'll notice the "1" element has a number of elements around it, which the others don't.

Your original XPath query does not return an element, this is why you are getting the error. In the event the XPath query you give HtmlAgilityPack does not result in a DOM element, it will return null.

Now, because you've not shown your entire code, I don't know how this code is being run. However, I am guessing you are trying to loop through all of the calendar items. Regardless, you have multiple ways of doing this, but I will show you that with the descendant XPath selector, you can just grab the whole lot in one go:

//div[@class='kal']//table//descendant::div[@class='cipars']

This will return all of the calendar items (ie 1 through 30).

However, to get all the items in a particular row, you can just stick that tr into the query:

//div[@class='kal']//table//tr[3]/descendant::div[@class='cipars']

This would return 2 to 8 (the second row of calendar items).

To target a specific one, well, you'll have to make an assumption on the source code of the website. It looks like that every "cipars" div has an ancestor of a td with a class datums....so to get the "3" value from your question:

//div[@class='kal']//table//tr[3]//td[@class='datums'][2]/div[@class='cipars']

Hopefully this is enough to show the issue at least.

Edit

Although you do have an XPath problem, you also have another issue.

The site is created very strangely. The calendar is loaded in a strange way. When I hit that URL, the calendar is created by some Javascript calling an XML web service (written in PHP) that then calculates the full table to be used for the calendar.

Due to the fact this is Javascript (client side code), HtmlAgilityPack won't execute it. Therefore, HtmlAgilityPack doesn't even "see" the table. Hence the queries against it come back as "not found" (null).

Ways around this: 1) Use a tool that will call the scripts. By this, I mean load up a Browser. A great tool to use for this is called Selenium. This will probably be the better overall solution because it means all the scripting used by the site will actually be called. You can still use XPath with it, so your queries will not change.

The second way is to send a request off to the same web service that the page does. This is to basically get back the same HTML that the page is getting, and using that with HtmlAgilityPack. How do we do that?

Well, you can easily POST data to a web service using C#. Just for ease of use I've stolen the code from this SO question. With this, we can send the same request the page is, and get the same HTML back.

So to send some POST data, we generate a method like so.....

public static string SendPost(string url, string postData)
{
    string webpageContent = string.Empty;

    byte[] byteArray = Encoding.UTF8.GetBytes(postData);

    HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
    webRequest.Method = "POST";
    webRequest.ContentType = "application/x-www-form-urlencoded";
    webRequest.ContentLength = byteArray.Length;

    using (Stream webpageStream = webRequest.GetRequestStream())
    {
        webpageStream.Write(byteArray, 0, byteArray.Length);
    }

    using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
    {
        using (StreamReader reader = new StreamReader(webResponse.GetResponseStream()))
        {
            webpageContent = reader.ReadToEnd();
        }
    }

    return webpageContent;
}

We can call it like so:

string responseBody = SendPost("http://lekcijas.va.lv/lekcijas_request.php", "nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=");

How did I get this? Well the php file we are calling is the web service the page is, and the POST data is too. The way I found out what data it sends to the service is by debugging the Javascript (using Chrome's Developer console), but you may notice it's pretty much the same thing that is in the URL. That seems to be intentional.

The responseBody that is returned is the physical HTML of just the table for the calendar.

What do we do with it now? We load that up into HtmlAgilityPack, because it is able to accept pure HTML.

var document = new HtmlDocument();
document.LoadHtml(webpageContent);

Now, we stick that original XPath in:

var node = document.DocumentNode.SelectSingleNode("//div[@class='kal']//table//tr[3]//td[@class='datums'][2]/div[@class='cipars']");

Now, we print out what should hopefully be "3":

Console.WriteLine(node.InnerText);

My output, running it locally, is indeed: 3.

However, although this would get you over the problem you are having, I am assuming the rest of the site is like this. If this is the case, you may still be able to work around it using technique above, but tools like Selenium were created for this very reason.

Community
  • 1
  • 1
Arran
  • 24,648
  • 6
  • 68
  • 78
  • Thanks for answer, ill look in to it. But it this case I did notice that other elements has different count of elements within them. So I wanted to extract number 3 explicitly. Using this code: `HtmlWeb web = new HtmlWeb(); HtmlDocument doc = web.Load("http://lekcijas.va.lv/? nodala=IT&kurss=1&gads=2013&menesis=9&c_dala=#tabs-1"); HtmlNode dateNode = doc.DocumentNode.SelectSingleNode("//div[@class='kal']//table//tbody//tr[1]/td[2]"); string date = dateNode.InnerText; date9.Text = date;` – Nauris Andzans Sep 25 '13 at 15:30
  • I tried it your way.. still no luck! Maybe I should try using full XPath like user Mpora suggested in http://stackoverflow.com/questions/14968729/html-agility-pack-loop-through-table-rows-and-columns – Nauris Andzans Sep 25 '13 at 15:46
  • To get "3" specifically, you want to use: `//div[@class='kal']//table//tr[3]//td[@class='datums'][2]/div[@class='cipars']` (P.S, remember XPath indexers are 1-based, not 0-based like C#) – Arran Sep 25 '13 at 15:52
  • Still, your Xpath shows the same error. How can this trivial problem cause such a headache.. – Nauris Andzans Sep 25 '13 at 16:05
  • Thumbs up @Arran. You have put serious work here. All I have to figure out now, is how to do all this in Xamarin, since I'm working on iOS app... – Nauris Andzans Sep 25 '13 at 19:07