0

I'm trying to crawl a web but can't get past the login using the .net HttpRequest and HttpResponse classes. Using net monitor, it seems a key difference is that the login from a browser includes a payload in the POST message, whereas the HttpRequest sends the payload in a separate message, which gets a 301 response. Is there a way to make it use a single message? Or is there something else I'm missing? I've used this code for another web site, which worked:

// Set GET to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(logonUrl);

SiteRequest.Method = "GET";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.Referer = logonUrl;

SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
mainStream = SiteResponse.GetResponseStream();
ReadAndIgnoreAllStreamBytes(mainStream);
mainStream.Close();

// Send POST to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(postUrl);
SiteRequest.Method = "POST";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.ContentType = "application/x-www-form-urlencoded";
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.CookieContainer.Add(SiteResponse.Cookies);
SiteRequest.Referer = postUrl;
SiteRequest.Timeout = TimeoutMsec;

buffer = Encoding.UTF8.GetBytes(logonPostData);
SiteRequest.ContentLength = buffer.Length;

postStream = SiteRequest.GetRequestStream();
postStream.Write(buffer, 0, buffer.Length);
postStream.Flush();
postStream.Close();

SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();

Using the HtmlWeb class in HtmlAgilityPack has the same issue.

Thanks.

Update:

Turns out I was using the "www.example.com" form of the address, and not "example.com", hence the redirect. But I get a "404" page not found error with the correct address.

Here's what the browser is sending for the post:

- Http: Request, POST /accounts/signin 
    Command: POST
  + URI: /accounts/signin
    ProtocolVersion: HTTP/1.1
    Accept:  text/html, application/xhtml+xml, */*
    Referer:  http://***.com/accounts/signin
    Accept-Language:  en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
    UserAgent:  Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; Touch)
  + ContentType:  application/x-www-form-urlencoded
    Accept-Encoding:  gzip, deflate
    Host:  example.com
    ContentLength:  67
    DNT:  1
    Connection:  Keep-Alive
    Cache-Control:  no-cache
  - Cookie:  PHPSESSID=169***efe; lang=en_US; cart=eyJ***wfQ%3D%3D; cartitems=W10%3D; __utma=***; __utmb=***; __utmc=**; __utmz=**
      PHPSESSID: 169***efe
      lang: en_US
      cart: eyJ***wfQ%3D%3D
      cartitems: W10%3D
      __utma: ***
      __utmb: ***
      __utmc: ***
      __utmz: ***

    HeaderEnd: CRLF
  - payload: HttpContentType =  application/x-www-form-urlencoded
     url: 
     email: ***
     password: ***

Here's what I'm sending:

(POST:)

- Http: Request, POST /accounts/signin 
    Command: POST
  + URI: /accounts/signin
    ProtocolVersion: HTTP/1.1
  + ContentType:  application/x-www-form-urlencoded
    Accept:  text/html, application/xhtml+xml, */*
    Accept-Language:  en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
    Accept-Encoding:  gzip, deflate
    DNT:  1
    Cache-Control:  no-cache
    Referer:  http://***.com/accounts/signin
    Host:  chinesepod.com
  - Cookie:  lang=en_US; cart=eyJ***jowfQ%3D%3D; cartitems=W10%3D; PHPSESSID=944***3e7
      lang: en_US
      cart: eyJ***wfQ%3D%3D
      cartitems: W10%3D
      PHPSESSID: 944***3e7

    ContentLength:  61
    HeaderEnd: CRLF

(separate payload:)

- Http: HTTP Payload, URL: /accounts/signin 
  - payload: HttpContentType =  application/x-www-form-urlencoded
     url: 
     email: ***
     password: ***

The browser version has these __utXX cookies, which I'm assuming the browser adds for some kind of tagging, right? Otherwise the key difference, assuming cookie ordering doesn't matter, is that the payload is sent separately. See anything else amiss?

Thanks.

-John

arserbin3
  • 6,010
  • 8
  • 36
  • 52
jtsoftware
  • 521
  • 3
  • 14
  • Your code sends the POST message as a payload. The 301 is a permanent redirect. I suspect the problem you're having is that `AllowRedirect` is `false`. What happens if you write `SiteRequest.AllowAutoRedirect = true;` – Jim Mischel Nov 08 '13 at 16:01
  • Setting AllowRedirect to true or false didn't make any difference. Is there an example somewhere of doing raw HTTP with .Net? I got past this by getting the PHPSESSID cookie from a login session in the browser using NetMonitor, and plugging it in my later transactions, but it would be nice not to have to do that. I'm thinking that if I can duplicate what the browser sends in the login, it should work. – jtsoftware Nov 12 '13 at 14:10
  • I would suggest that you get [Fiddler](http://fiddler2.com/) and examine the traffic. That will tell you if you're communicating the cookie and other data correctly. Also see http://stackoverflow.com/q/2972643/56778 regarding cookies. – Jim Mischel Nov 12 '13 at 14:41
  • I'm using NetMonitor, which should show the same data, right? – jtsoftware Nov 16 '13 at 16:11

0 Answers0