2

I'm currently making a C# console application with HtmlAgilityPack where I am trying to get a parameter value of a link that is on a webpage. So basically I have a webpage, on that page there are a bunch of links. And one of the links has a parameter called "&pagenumber=[some number]". What I am trying to get is the value after &pagenumber= and save that to an int variable.

Steps:

  1. Go to website (http://forum.tibia.com/forum/?action=board&boardid=25&threadage=-1)

  2. Look for the text "Last Page" in a url at the bottom of the page:

<a href="http://forum.tibia.com/forum/?action=board&amp;boardid=25&amp;threadage=-1&amp;pageitems=30&amp;pagenumber=974">Last Page</a>

  1. Grab the parameter value from "pagenumber" (in this case "974")

  2. Save it to an integer variable

My code so far:

string PageLink = "http://forum.tibia.com/forum/?action=board&boardid=25&threadage=-1";
Task.Run(async () =>
{
    using (var client = new HttpClient())
    {
        // Load the html of the page
        var html = await client.GetStringAsync(PageLink);
        var document = new HtmlAgilityPack.HtmlDocument();
        document.LoadHtml(html);

        // Find the "Last Page" link at bottom of page
        var lastPageLink = document.DocumentNode.Descendants("a").First(x => x.Attributes["href"].Value.Contains("&amp;threadage=-1&amp;pageitems=30&amp;pagenumber=")).InnerHtml;

        // Print out the pagenumber value
        Console.WriteLine(lastPageLink);
    }
}).Wait(1000);

However, my code does not print anything so I am wondering what I am doing wrong here. I don't get any error. I basically tried to find all the links (a-tag), I look at the "href" value of them and see if it contains "&threadage=-1&pageitems=30&pagenumber=". And if it does, it should select the html code of it.

So right now, I want my code to print: http://forum.tibia.com/forum/?action=board&boardid=25&threadage=-1&pageitems=30&pagenumber=974

I can then move forward to use Regex or something, to get the "974".

It is very important that the url contains "board&boardid=25&threadage=-1", because there are other links with the "Last Page" value in it.

Lee Cheung
  • 101
  • 2
  • 9
  • `Task.Run(async () =>...).Wait(1000);` looks wrong. Not sure if it's the cause of your problem. I'm guessing you've done this because you need a non async hook for your console. Do it [this way instead](https://stackoverflow.com/a/9212343/542251) – Liam Jun 06 '18 at 15:29
  • Well first I need to get the values. So I wouldn't even focus on the .Wait() right now. The point is, I am not able to get any value from the link. And I am trying to figure out what I am doing wrong there. Shouldn't I look for the "a" links, look if the "href" contains that text, then return the entire href value? – Lee Cheung Jun 06 '18 at 15:32
  • If I run this code, the task is throwing null reference exceptions because in many cases `x.Attributes["href']` is null. – Shelby115 Jun 06 '18 at 15:33

1 Answers1

1
var lastPageLink = document.DocumentNode.Descendants("a").First(x => x.Attributes["href"] != null && x.Attributes["href"].Value.Contains("&amp;threadage=-1&amp;pageitems=30&amp;pagenumber=")).Attributes["href"].Value;

Two changes,

  1. I added x.Attributes["href"] != null && to the front of the lambda statement to prevent NullReferenceException when the link doesn't have an href attribute.
  2. Switched .InnerHtml to .Attributes["href"].Value to print the URL instead of Last Page.

Parsing

var matchingString = "&amp;threadage=-1&amp;pageitems=30&amp;pagenumber=";
var start = lastPageLink.IndexOf(matchingString) + matchingString.Length;
var end = lastPageLink.Length - start;
var pageNumber = lastPageLink.IndexOf(matchingString) >= 0 ? lastPageLink.Substring(start, end) : "Unknown";
Console.WriteLine("Page #: " + pageNumber);

Should get you what you want. I don't know regex so if you want to use that you'll have to figure that out yourself.

NOTE: I made the assumption that pagenumber would be the last url parameter which isn't always true. So if you're using this code for more than just short-term use I would adjust it accordingly.

Shelby115
  • 2,816
  • 3
  • 36
  • 52
  • I'm trying to get the href for the Last Page. Not the actual text "Last Page", but the link. And more precisely, I am trying to get just the value after "&pagenumber=" in that href. But I guess the first step is getting the full link. – Lee Cheung Jun 06 '18 at 15:37
  • Edit: I see you updated your post. That works good! Thank you! Now I get the full href. Now I just need to try to find just the pagenumber value. Is there a way to do it directly in the same code? Or do I need to store the href value to a string, then use Regex to find the pagenumber=XXXXX – Lee Cheung Jun 06 '18 at 15:38
  • So basically, I am trying to get "974" as output in this case. Instead of the entire href value – Lee Cheung Jun 06 '18 at 15:41
  • @LeeCheung Yeah, if you're wanting to use regex you would just take the string version of the URL (i.e. `lastPageLink`) and run it through. – Shelby115 Jun 06 '18 at 15:48
  • I managed to get it with this line: `var lastPageValue = lastPageLink.Split('=').Last();` – Lee Cheung Jun 06 '18 at 15:48
  • Thank you for your help! You solved the big big big problem I had with the code. Rep++ !!!! – Lee Cheung Jun 06 '18 at 15:49
  • Your method of parsing is not necessarily going to always work, url parameters don't have to be in the same order every time btw. Should work in the short-term though. That said I guess mine assumes it to be last too lol. – Shelby115 Jun 06 '18 at 15:49
  • Oh yea, you're right. The order can change in the href. – Lee Cheung Jun 06 '18 at 15:52