4

I want to get information about a Microsoft Update in my program. However, the server returns a 404 error at about 80 % of the time. I boiled the problematic code down to this console application:

using System;
using System.Net;

namespace WebBug
{
    class Program
    {
        static void Main(string[] args)
        {
            while (true)
            {
                try
                {
                    WebClient client = new WebClient();
                    Console.WriteLine(client.DownloadString("https://support.microsoft.com/api/content/kb/3068708"));
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message);
                }
                Console.ReadKey();
            }
        }
    }
}

When I run the code, I have to get through the loop a few times until I get an actual response:

The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
<div kb-title title="Update for customer experience and diagnostic telemetry [...]

I can open and force refresh (Ctrl + F5) the link in my browser as often as I want to, but it'll show fine.

The problem occurs on two different machines with two different internet connections.
I've also tested this case using the Html Agility Pack, but with the same result.
The problem does not occur with other websites. (The root https://support.microsoft.com works fine 100 % of the time)

Why do I get this weird result?

Physikbuddha
  • 1,652
  • 1
  • 15
  • 30
  • problem with the net connection i guess ... – Pranay Rana Jul 10 '15 at 19:05
  • @PranayRana As I wrote, I tested this behaviour on two different machines with different ISPs. **And:** The result is not a simple connection error, I get an actual webpage from the Microsoft IIS server (which is the default 404 template). The problem does not occur with other websites. – Physikbuddha Jul 10 '15 at 19:08
  • I'm able to reproduce this as well. My best guess is something with the User Agent string (or similar), and microsoft is giving you a 404 so that bots won't connect. I'll research into this. – gunr2171 Jul 10 '15 at 19:10
  • I've tried that URL on my browser but getting a 404 so it may not be related to browser vs custom application thing. – Volkan Paksoy Jul 10 '15 at 19:13
  • I can repro same behavior: console app returns 404 in 80 % of cases (I think even higher) and browser (Chrome) gives no problem. However, the 'internal browser' of VisualStudio (it opens when you hold CTRL and click the URL in the code) gives the same: 80 % of cases 404. – Sjips Jul 10 '15 at 19:26
  • I've noticed that the first time you open the site in a new browser it gives you a 404, including IE and chrome (using incognito), refreshing then makes it work. Is this cookie related? – gunr2171 Jul 10 '15 at 19:27
  • I've tried to use WireShark to see what is going on down the line. Because it is HTTPS, it is not possible to decode. However, I see that the good result always comes from a group IP addresses in the 95.100.162.x range, same for the 404. Looks like the load balancer sometimes directs you to a server which has the correct file, and most of times to a server without it. This does not explain why using a web browser it almost always works (but I see comments that this is not always the case). – Sjips Jul 10 '15 at 19:43
  • Hmm, the results are very interesting, thanks @all for your detailed research. I think I'll get around the problem by writing an async method that tries to get the content (until a specified timeout). – Physikbuddha Jul 10 '15 at 19:50

1 Answers1

7

Cookies. It's because of cookies.

As I started to dig into this problem I noticed that the first time I opened the site in a new browser I got a 404, but after refreshing (sometimes once, sometimes a few times) the site continued to work.

That's when I busted out Chrome's Incognito mode and the developer tools.

There wasn't anything too fishy with the network: there was a simple redirect to the https version if you loaded http.

But what I did notice was the cookies changed. This is what I see the first time I loaded the page:

enter image description here

and here's the page after a (or a few) refreshes:

enter image description here

Notice how a few more cookie entries got added? The site must be trying to read those, not finding them, and "blocking" you. This might be a bot-prevention device or bad programming, I'm not sure.

Anyways, here's how to make your code work. This example uses the HttpWebRequest/Response, not WebClient.

string url = "https://support.microsoft.com/api/content/kb/3068708";

//this holds all the cookies we need to add
//notice the values match the ones in the screenshot above
CookieContainer cookieJar = new CookieContainer();
cookieJar.Add(new Cookie("SMCsiteDir", "ltr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("SMCsiteLang", "en-US", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smc_f", "upr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smcexpsessionticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcexpticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcflighting", "wwp", "/", ".microsoft.com"));

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
//attach the cookie container
request.CookieContainer = cookieJar;

//and now go to the internet, fetching back the contents
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
    string site = sr.ReadToEnd();
}

If you remove the request.CookieContainer = cookieJar;, it will fail with a 404, which reproduces your issue.

Most of the legwork for the code example came from this post and this post.

Community
  • 1
  • 1
gunr2171
  • 16,104
  • 25
  • 61
  • 88