I'm trying to crawl a web but can't get past the login using the .net HttpRequest
and HttpResponse
classes. Using net monitor, it seems a key difference is that the login from a browser includes a payload in the POST message, whereas the HttpRequest
sends the payload in a separate message, which gets a 301 response. Is there a way to make it use a single message? Or is there something else I'm missing? I've used this code for another web site, which worked:
// Set GET to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(logonUrl);
SiteRequest.Method = "GET";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.Referer = logonUrl;
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
mainStream = SiteResponse.GetResponseStream();
ReadAndIgnoreAllStreamBytes(mainStream);
mainStream.Close();
// Send POST to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(postUrl);
SiteRequest.Method = "POST";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.ContentType = "application/x-www-form-urlencoded";
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.CookieContainer.Add(SiteResponse.Cookies);
SiteRequest.Referer = postUrl;
SiteRequest.Timeout = TimeoutMsec;
buffer = Encoding.UTF8.GetBytes(logonPostData);
SiteRequest.ContentLength = buffer.Length;
postStream = SiteRequest.GetRequestStream();
postStream.Write(buffer, 0, buffer.Length);
postStream.Flush();
postStream.Close();
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
Using the HtmlWeb class in HtmlAgilityPack has the same issue.
Thanks.
Update:
Turns out I was using the "www.example.com" form of the address, and not "example.com", hence the redirect. But I get a "404" page not found error with the correct address.
Here's what the browser is sending for the post:
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: http://***.com/accounts/signin
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
UserAgent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; Touch)
+ ContentType: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Host: example.com
ContentLength: 67
DNT: 1
Connection: Keep-Alive
Cache-Control: no-cache
- Cookie: PHPSESSID=169***efe; lang=en_US; cart=eyJ***wfQ%3D%3D; cartitems=W10%3D; __utma=***; __utmb=***; __utmc=**; __utmz=**
PHPSESSID: 169***efe
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
__utma: ***
__utmb: ***
__utmc: ***
__utmz: ***
HeaderEnd: CRLF
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
Here's what I'm sending:
(POST:)
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
+ ContentType: application/x-www-form-urlencoded
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
Accept-Encoding: gzip, deflate
DNT: 1
Cache-Control: no-cache
Referer: http://***.com/accounts/signin
Host: chinesepod.com
- Cookie: lang=en_US; cart=eyJ***jowfQ%3D%3D; cartitems=W10%3D; PHPSESSID=944***3e7
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
PHPSESSID: 944***3e7
ContentLength: 61
HeaderEnd: CRLF
(separate payload:)
- Http: HTTP Payload, URL: /accounts/signin
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
The browser version has these __utXX cookies, which I'm assuming the browser adds for some kind of tagging, right? Otherwise the key difference, assuming cookie ordering doesn't matter, is that the payload is sent separately. See anything else amiss?
Thanks.
-John