c# parsing data from wikipedia through html-agility-pack

Question

I want to extract the release date of the film from this link.

The problem is that it is given directly in a <td> tag, which has no class or id. The only possible solution that I can think of is using the style tag to extract data but I have no idea on how to do it.

Here's my code

url = "https://en.wikipedia.org/wiki/" + textBox1.Text.Replace(" ", "_");
try
{
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes(/*?*/))
    {
        label1.Text+=node.InnerText;
    }                                
}
catch (Exception ex3) { }

Please help!

Why don't you just use the [API](https://en.wikipedia.org/w/api.php)? Or since you want to get info about a movie the [API of some movie db](https://developer.fandango.com/Rotten_Tomatoes)? Honestly, downloading a wiki-page and manually parsing it would be the **last** thing I'd do. — Manfred Radlwimmer, Aug 17 '17 at 13:28
@Manfred Radlwimmer its sorta project and im only allowed to use html-agility-pack — Kabeer, Aug 17 '17 at 13:31
If by that you mean it's some sort of school assignment, then whoever is teaching you is leading you down a very wrong path. — Manfred Radlwimmer, Aug 17 '17 at 13:33
Then who's stopping you from doing this *the right way*? The html-agility-pack has it's uses and familiarity with it doesn't hurt but it should be a last resort. When a site offers APIs, WebServices, RSS or anything similar - use that instead. — Manfred Radlwimmer, Aug 17 '17 at 13:38

score 0 · Answer 1 · answered Aug 17 '17 at 13:11

0

The following XPath expression gives you the element you need:

//*[@id="mw-content-text"]/div/table[1]/tbody/tr[14]/td

Pro tip: Open Chrome debugger tools, navigate to the element you are searching for, right click and hit "Copy > Copy xpath".

Suggestion: The XPath expression seems rather brittle. Sometimes it makes more sense trying to extract specific parts of the HTML with RegEx, which might lead to a more stable solution. However, don't try to parse HTML with Regex!

answered Aug 17 '17 at 13:11

larsbe

397
3
10

2

table[1] and tr[14] are using index. On a different wiki page, this will not work. I think it is better to retrieve the whole table and check for the correct th element with the text 'Release Date' – Sebastian Siemens Aug 17 '17 at 13:20
True! As I said, at this point it might make sense to use RegEx or just iterate over the table rows. – larsbe Aug 17 '17 at 13:22

c# parsing data from wikipedia through html-agility-pack

1 Answers1