1

I want to scrape the HTML of a website. When I access this website with my browser (no matter if it is Chrome or FireFox), I have no problem accessing the website + HTML.

When I try to parse the HTML with C# using methods like HttpWebRequest and HtmlAgilityPack, the website redirects me to another website and thus I parse the HTML of the redirected website.

Any idea how to solve this problem?

I thought the site recognises my program as a program and redirects immediately, so I tried using Selenium and a GoogleDriver and FireFoxDriver but also no luck, I get redirected immediately.

The Website: https://www.jodel.city/7700#!home

private void bt_load_Click(object sender, EventArgs e)
{
        var url = @"https://www.jodel.city/7700#!home";
        var req = (HttpWebRequest)WebRequest.Create(url);
        req.AllowAutoRedirect = false;
        // req.Referer = "http://www.muenchen.de/";
        var resp = req.GetResponse();
        StreamReader sr = new StreamReader(resp.GetResponseStream());
        String returnedContent = sr.ReadToEnd();

        Console.WriteLine(returnedContent);
        return;
}
derloopkat
  • 6,232
  • 16
  • 38
  • 45
Zesa Rex
  • 412
  • 1
  • 4
  • 16

1 Answers1

3

And of course, cookies are to blame again, because cookies are great and amazing.

So, let's look at what happens in Chrome the first time you visit the site:

(I went to https://www.jodel.city/7700#!home):

enter image description here

Yes, I got a 302 redirect, but I also got told by the server to set a __cfduid cookie (twice actually).

When you visit the site again, you are correctly let into the site:

enter image description here

Notice how this time a __cfduid cookie was sent along? That's the key here.

Your C# code needs to:

  1. Go to the site once, get redirected, but obtain the cookie value from the response header.
  2. Go BACK to the site with the correct cookie value in the request header.

You can go to the first link in this post to see an example of how to set cookie values for requests.

gunr2171
  • 16,104
  • 25
  • 61
  • 88
  • 1
    Nice debugging-Fu gunr2171. – Sam Axe Oct 09 '17 at 20:03
  • Now, for the record, I don't know _why_ the server is doing this. As in that related post, you shouldn't be requiring your client to have a cookie value _before_ they reach the site. Hopefully this is just bad programming on the server's portion. – gunr2171 Oct 09 '17 at 20:03
  • You are awesome. Just tested it with my CookieValues to see if it works. It works! Just need to get the Cookies dynamically but I can do that on my own. thanks – Zesa Rex Oct 09 '17 at 20:09