Grabbing HTML from URL doesn't work - any tips?

Question

I have tried several methods in C# using webclient and webresponse and they all return

<html><head><meta http-equiv=\"REFRESH\" content=\"0; URL=http://www.windowsphone.com/en-US/games?list=xbox\"><script type=\"text/javascript\">function OnBack(){}</script></head></html>"

instead of the actual rendered page when you use a browser to go to http://www.windowsphone.com/en-US/games?list=xbox

How would you go about grabbing the HTML from that location? http://www.windowsphone.com/en-US/games?list=xbox

Thanks!

/edit: examples added:

Tried:

        string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
        string resultHTML = String.Empty;
        Uri inputUri = new Uri(inputUrl);
        WebRequest request = WebRequest.CreateDefault(inputUri);
        request.Method = "GET";

        WebResponse response;
        try
        {
            response = request.GetResponse();
            using (StreamReader reader = new StreamReader(response.GetResponseStream()))
            {
                resultHTML = reader.ReadToEnd();
            } 
        }
        catch { }

Tried:

        string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
        string resultHTML = String.Empty;
        WebClient webClient = new WebClient();

        try
        {
            resultHTML = webClient.DownloadString(inputUrl);
        }
        catch { }

Tried:

        string inputUrl = "http://www.windowsphone.com/en-US/games?list=xbox";
        string resultHTML = String.Empty;
        WebResponse objResponse;
        WebRequest objRequest = HttpWebRequest.Create(inputUrl);

        try
        {
            objResponse = objRequest.GetResponse();
            using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
            {
                resultHTML = sr.ReadToEnd();
                sr.Close();
            }
        }
        catch { }

You ARE getting the HTML. HTML is the markup code that a web server responds with. Are you looking to get a screen capture? Are you looking to embed a web browser in a different application? — Nick, Feb 03 '12 at 21:04
Nick, I want the HTML. The HTML I get using the methods mentioned just don't return the HTML my WebBrowser returns? — menschab, Feb 03 '12 at 21:05
Try adding a proper UserAgent to the request, sometimes these sites don't allow access if the requests don't appear to be coming from a legitimate web browser. — drew010, Feb 03 '12 at 21:07
Hi, they use meta tags to redirect user to a page. What you get is a proper response from the server. As drew010 said they might try to prevent screen scrapers from accessing the website. — Sebastian Siek, Feb 03 '12 at 21:11
Also, if the way you're retrieving it can't perform Javascript, then you're still going to be out of luck. Looks like this could be an issue. — AHungerArtist, Feb 03 '12 at 21:11

score 2 · Accepted Answer · answered Feb 03 '12 at 21:19

2

I checked for this URL, and you need to parse the cookies.

When you try to access the page for the first time, you are redirected to an https URL on login.live.com and then redirected back to the original URL. The https page sets a cookie called MSPRequ for the domain login.live.com. If you do not have this cookie, you cannot access the site.

I tried disabling cookies in my browser and it ends up looping infinitely back to the URL https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&checkda=1&ct=1328303901&rver=6.1.6195.0&wp=MBI&wreply=http:%2F%2Fwww.windowsphone.com%2Fen-US%2Fgames%3Flist%3Dxbox&lc=1033&id=268289. It's been going on for several minutes now and doesn't appear it will ever stop.

So you will have to grab the cookie from the https page when it is set, and persist that cookie for your subsequent requests.

answered Feb 03 '12 at 21:19

drew010

68,777
11
134
162

Thanks Drew, this looks to be exactly right.. This is getting way beyond me though :( – menschab Feb 03 '12 at 21:31
I'll try to grab it using a browser object.. I didn't want to do this, because I don't need all the fancy graphics, just the plain HTML, but it'll get the job done.. – menschab Feb 03 '12 at 21:34
I found this [answer](http://stackoverflow.com/questions/2825377/how-can-i-get-the-webclient-to-use-cookies) that shows how to extend webclient so it can persist cookies for you. That may help. – drew010 Feb 03 '12 at 21:45
Thanks again. Tried using that Class but with the same result. – menschab Feb 03 '12 at 22:51

score 1 · Answer 2 · answered Feb 03 '12 at 21:10

1

This might be because the server you are requesting HTML from returns different HTML depending on the User Agent string. You might try something like this

webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

That particular header may not work, but you could try others that would mimic standard browsers.

answered Feb 03 '12 at 21:10

Nick

5,875
1
27
38

Thanks Nick, I will try adding the headers to the webclient. If this one doesn't work, I'll try some other stuff as well using header info. – menschab Feb 03 '12 at 21:20

Grabbing HTML from URL doesn't work - any tips?

2 Answers2