11

I wrote a xml grabber to receive/decode xml files from website. It works fine mostly but it always return error:

"The remote server returned an error: (403) Forbidden."

for site http://w1.weather.gov/xml/current_obs/KSRQ.xml

My code is:

CookieContainer cookies = new CookieContainer();
HttpWebRequest webRequest = (HttpWebRequest)HttpWebRequest.Create(Path);
webRequest.Method = "GET";
webRequest.CookieContainer = cookies;
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
    using (StreamReader streamReader = new StreamReader(webResponse.GetResponseStream()))
    {
        string xml = streamReader.ReadToEnd();
        xmldoc.LoadXml(xml);
    }
}

And the exception is throw in GetResponse method. How can I find out what happened?

MethodMan
  • 18,625
  • 6
  • 34
  • 52
ncite
  • 533
  • 1
  • 4
  • 19
  • have you used the debugger..? what does `webRequest` look like in your 2nd line of code..? what does the xml look like when you are doing the `ReadToEnd();` Method..? look at the answer here with `88` upvotes http://stackoverflow.com/questions/7543324/how-to-convert-webresponse-getresponsestream-return-into-a-string – MethodMan Jul 12 '16 at 18:42
  • @MethodMan - based on the last line I do not think the code is making it that far. It fails on the first `using` block on method `GetResponse()`. – Igor Jul 12 '16 at 18:44
  • > requests xml > extension is xml > gets HTML and you'll like it > government –  Jul 12 '16 at 18:48
  • @Igor No, it is not expected. I was confused as well. But I can get the real xml file in chrome browser's debug console. – ncite Jul 12 '16 at 18:48
  • @ncite - I just noticed that too when I went to view source in my browser. I believe Mike and Stan have the correct answers here, that is what I would try first. – Igor Jul 12 '16 at 18:49
  • @Will: It is returning XML in fact. The XML contains a stylesheet directive that is being read and automagically processed by the browser. When the OP makes his request in code it will return the pure XML. It's a pretty web friendly way to do it I think :o) – Mike Goodwin Jul 12 '16 at 19:00
  • 1
    @MikeGoodwin ... ... ... .... yeah, but still. –  Jul 12 '16 at 19:13

4 Answers4

22

It could be that your request is missing a header that is required by the server. I requested the page in a browser, recorded the exact request using Fiddler and then removed the User-Agent header and reissued the request. This resulted in a 403 response.

This is often used by servers in an attempt to prevent scripting of their sites just like you are doing ;o)

In this case, the server header in the 403 response is "AkamaiGHost" which indicates an edge node from some cloud security solution from Akamai. Maybe a WAF rule to prevent bots is triggering the 403.

It seems like adding any value to the User-Agent header will work for this site. For example I set it to "definitely-not-a-screen-scraper" and that seems to work fine.

In general, when you have this kind of problem it very often helps to look at the actual HTTP requests and responses using browser tools or a proxy like Fiddler. As Scott Hanselman says

The internet is not a black box

http://www.hanselman.com/blog/TheInternetIsNotABlackBoxLookInside.aspx

Mike Goodwin
  • 8,810
  • 2
  • 35
  • 50
  • 1
    I agree, this would probably be the culprit. @ncite - see this previous SO answer [How to consume WebAp2 without any authentication in C#](http://stackoverflow.com/a/38316025/1260204) I wrote yesterday that had the exact same problem and was fixed by adding the user-agent header. – Igor Jul 12 '16 at 18:48
  • Thank you! Both Mike Goodwin and @Igor. – ncite Jul 12 '16 at 19:07
  • It can be things other than `User-Agent` too, I just had an issue that turned out to be a missing `X-Requested-With` header. – DavidP Jan 04 '19 at 20:58
16

Clearly, the URL works from a browser. It just doesn't work from the code. It would appear that the server is accepting/rejecting requests based on the user agent, probably as a very basic way of trying to prevent crawlers.

To get through, just set the UserAgent property to something it will recognize, for instance:

webRequest.UserAgent = @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36";

That does seem to work.

sstan
  • 35,425
  • 6
  • 48
  • 66
2

In my particular case, it was not the UserAgent header, but the Accept header that the server didn't like.

request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";

You can use the browsers network tab of dev tools to see what the correct headers should be.

Stew
  • 21
  • 1
1

Is your request going through a proxy server? If yes, add the following line before your GetResponse() call.

webRequest.Proxy.Credentials = System.Net.CredentialCache.DefaultCredentials;
Shiva
  • 20,575
  • 14
  • 82
  • 112