9

Using: Delphi 2010, latest version of Indy

I am trying to scrape the data off Googles Adsense web page, with an aim to get the reports. However I have been unsuccessful so far. It stops after the first request and does not proceed.

Using Fiddler to debug the traffic/requests to Google Adsense website, and a web browser to load the Adsense page, I can see that the request (from the webbrowser) generates a number of redirects until the page is loaded.

However, my Delphi application is only generating a couple of requests before it stops.

Here are the steps I have followed:

  1. Drop a IdHTTP and a IdSSLIOHandlerSocketOpenSSL1 component on the form.
  2. Set the IdHTTP component properties AllowCookies and HandleRedirects to True, and IOHandler property to the IdSSLIOHandlerSocketOpenSSL1.
  3. Set the IdSSLIOHandlerSocketOpenSSL1 component property Method := 'sslvSSLv23'

Finally I have this code:

procedure TfmMain.GetUrlToFile(AURL, AFile : String);
var
 Output : TMemoryStream;
begin
  Output := TMemoryStream.Create;
  try
    IdHTTP1.Get(FURL, Output);
    Output.SaveToFile(AFile);
  finally
    Output.Free;
  end;
end;

However, it does not get to the login page as expected. I would expect it to behave as if it was a webbrowser and proceed through the redirects until it finds the final page.

This is the output of the headers from Fiddler:

HTTP/1.1 302 Found
Location: https://encrypted.google.com/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Set-Cookie: PREF=ID=5166063f01b64b03:FF=0:TM=1293571783:LM=1293571783:S=a5OtsOqxu_GiV3d6; expires=Thu, 27-Dec-2012 21:29:43 GMT; path=/; domain=.google.com
Set-Cookie: NID=42=XFUwZdkyF0TJKmoJjqoGgYNtGyOz-Irvz7ivao2z0--pCBKPpAvCGUeaa5GXLneP41wlpse-yU5UuC57pBfMkv434t7XB1H68ET0ZgVDNEPNmIVEQRVj7AA1Lnvv2Aez; expires=Wed, 29-Jun-2011 21:29:43 GMT; path=/; domain=.google.com; HttpOnly
Date: Tue, 28 Dec 2010 21:29:43 GMT
Server: gws
Content-Length: 226
X-XSS-Protection: 1; mode=block

Firstly, is there anything wrong with this output?

Is there something more that I should do to get the IdHTTP component to keep pursuing the redirects until the final page?

SteveL
  • 309
  • 1
  • 5
  • 15
  • 1
    Why aren't you using [the API](http://code.google.com/apis/adsense/developer/ReportService.html)? – Rob Kennedy Dec 29 '10 at 00:56
  • 2
    To apply for access to the Google Adsesnse API requires that your website gets 100,000 page views or more per day. This website does not qualify, so I have to do a manual scrape of the page. – SteveL Dec 29 '10 at 12:15
  • Maybe there is Javascript involved (you can disable it in the browser to verify - if the login still works, this proves I was wrong) – mjn Dec 31 '10 at 17:37
  • @mjn: Disabled JavaScript, cleared browser cache and reloaded the page - the login page loads fine. – SteveL Jan 02 '11 at 10:46

3 Answers3

8

IdHTTP component property values prior to making the call:

    Name := 'IdHTTP1';
    IOHandler := IdSSLIOHandlerSocketOpenSSL1;
    AllowCookies := True;
    HandleRedirects := True;
    RedirectMaximum := 35;
    Request.UserAgent := 
      'Mozilla/5.0 (Windows NT 5.1; rv:2.0b8) Gecko/20100101 Firefox/4.' +
      '0b8';
    HTTPOptions := [hoForceEncodeParams];
    OnRedirect := IdHTTP1Redirect;
    CookieManager := IdCookieManager1;

Redirect event handler:

procedure TfmMain.IdHTTP1Redirect(Sender: TObject; var dest: string; var
    NumRedirect: Integer; var Handled: Boolean; var VMethod: string);
begin
   Handled := True;
end;

Making the call:

  FURL := 'https://www.google.com';

  GetUrlToFile( (FURL + '/adsense/'), 'a.html');




  procedure TfmMain.GetUrlToFile(AURL, AFile : String);
  var
   Output : TMemoryStream;
  begin
    Output := TMemoryStream.Create;
    try
      try
       IdHTTP1.Get(AURL, Output);
       IdHTTP1.Disconnect;
      except

      end;
      Output.SaveToFile(AFile);
    finally
      Output.Free;
    end;
  end;





Here's the (request and response headers) output from Fiddler:

alt text

SteveL
  • 309
  • 1
  • 5
  • 15
  • this belongs in the original question (you may edit your question any time you want). Is TIdHTTP handling any redirects? Is this the first redirect you *ever* see or is this simply the last redirect? Put a brake point in IdHTTP1Redirect, is it ever stopping? – Cosmin Prund Jan 01 '11 at 18:52
2

Getting redirects going

TIdHTTP.HandleRedirects := True so it starts automatically handling redirects.

TIdHTTP.RedirectMaximum is used to set how many successive redirects should be handled.


Alternatively you may assign TIdHTTP.OnRedirect and set Handled := True from that handler. This is what I'm doing in a project that needs to read data from a WikiMedia web site (my own site).

About the HTTP response

Nothing wrong with that response, it's a very basic redirect to https://encrypted.google.com/. TIdHTTP should go to the given page in response. It also sets some cookies.

Other suggestions

Don't forget to assign an CookieManager and make sure you use the same CookieManager for all subsequent requests. If you don't you'll probably get redirected to the login page over and over again.

Cosmin Prund
  • 25,498
  • 2
  • 60
  • 104
  • [(* About the HTTP response Nothing wrong with that response, it's a very basic redirect to https://encrypted.google.com/. TIdHTTP should go to the given page in response. It also sets some cookies. *)] Can you tell me how IdHTTP should go to that given page? – SteveL Dec 29 '10 at 17:54
  • TIdHttp handles 302 status messages automatically and loads the new page automatically if you set HandleRedirects = True. – Cosmin Prund Dec 30 '10 at 07:06
  • I have done all that you say above. However, still no joy. It stops at a redirection and does not proceed (as is viewable from the request headers in Fiddler). -- What (code or property setting) is required to make IdHTTP keep following through until the last page? ie. what do I need to do to get IdHttp to act like a web-browser? (in my case I need to get to the Adsense login page; if already logged-in to Adsense then it is the Adsense home page that I need to download) – SteveL Dec 30 '10 at 08:11
  • 1
    Please show the final HTTP reply that TIdHTTP receives before it stops following redirects. Either Google actually stops sending redirects, or it is sending a reply that TIdHTTP cannot handle. – Remy Lebeau Jan 01 '11 at 05:38
  • @Remy: Code and response/request headers posted as an answer. – SteveL Jan 01 '11 at 10:14
0

In my case I needed to fix dest, because somehow I had ; in it!

procedure Tfrm1.IdHTTP1Redirect(Sender: TObject; var dest: string;
  var NumRedirect: Integer; var Handled: Boolean; var VMethod: string);
var
  i: Integer;
begin

  i := Pos(';', dest);
  if i > 0 then
  begin
    dest := Copy(dest,1, i - 1);
  end;

  Handled := True;
end;
Dejan Dozet
  • 948
  • 10
  • 26