Fetch the content of a web page with DELPHI

Question

I am trying to retrieve the <table><tbody> section of this page:

http://www.mfinante.ro/infocodfiscal.html?captcha=null&cod=18505138

I am using Delphi XE7.

I tried using IXMLHttpRequest, WinInet (InternetOpenURL(), InternetReadFile()), TRestClient/TRestRequest/TRestResponse, TIdHTTP.Get(), but all they retrieve is some gibberish, like this:

<html><head><meta http-equiv="Pragma" content="no-cache"/>'#$D#$A'<meta http-equiv="Expires" content="-1"/>'#$D#$A'<meta http-equiv="CacheControl" content="no-cache"/>'#$D#$A'<script>'#$D#$A'(function(){p={g:"0119a4477bb90c7a81666ed6496cf13b5aad18374e35ca73f205151217be1217a93610c5877ece5575231e088ff52583c46a8e8807483e7185307ed65e",v:"87696d3d40d846a7c63fa2d10957202e",u:"1",e:"1",d:"1",a:"challenge etc.

Look at this code for example:

program htttpget;

{$APPTYPE CONSOLE}
{$R *.res}

uses
  SysUtils, HTTPApp, IdHTTP, ActiveX;

var
  CoResult: Integer;
  HTTP: TIdHTTP;
  Query: String;
  Buffer: String;
begin
  try
    CoResult := CoInitializeEx(nil, COINIT_MULTITHREADED);
    if not((CoResult = S_OK) or (CoResult = S_FALSE)) then
    begin
      Writeln('Failed to initialize COM library.');
      Exit;
    end;
    HTTP := TIdHTTP.Create;
    Query := 'http://www.mfinante.ro/infocodfiscal.html?captcha=null' +
             '&cod=18505138';
    Buffer := HTTP.Get(Query);
    writeln(Buffer);
    HTTP.Destroy;
  except
  end;
end.

What is wrong with this page? I haven not done very many "get" functions in my life, but other websites return normal responses. Can someone at least clarify to me why this isn't working?

Are there other ways to get the content of this web page? Are there other programming languages (Java, scripting, etc) that can do this without third party software (like using Firefox source code to emulate a browser, fetch the page, without showing the window, and then copy the content).

What you call gibberish looks like a normal response with valid HTML and Javascript. What makes you think something is wrong with it? — Ondrej Kelle, Oct 06 '16 at 13:03
Yes, it's java script there, not gibberish, but if you access with a browser that link, and ViewSource, there is an entirely different code. — nostriel, Oct 06 '16 at 13:07
What you see after loading the page in the browser might be the result of running the initial script which can modify the page content after it's loaded. — Ondrej Kelle, Oct 06 '16 at 13:10
So what I retreive is a lot of functions, with code like this: {var table = "00000000 77073096 EE0E612C 990951BA 076DC419 706AF48F E963A535 9E6495A3 0EDB8832 79DCB8A4 E0D5E91E 97D2D988 but there is no table tag in it. if it's a script that modifies the page, and it may actually be so, is there a way to get the page content as seen in browser? — nostriel, Oct 06 '16 at 13:13
You can view the same original response using e.g. Developer Tools (Network) in Firefox or Chrome. The browser runs the script which then modifies the page which is shown by the browser. To achieve the same from your program, the easiest way is probably to embed (and automate) a browser. — Ondrej Kelle, Oct 06 '16 at 13:20
Thank you. David Heffernan said the same thing. There is another code int that page, so I need something to run it. And that is definitively not Delphi compiler. — nostriel, Oct 06 '16 at 13:31

David Heffernan · Answer 1 · 2016-10-06T13:19:57.617

3

This is normal, you have indeed retrieved the content correctly. What happens in your browser is that the script is executed and the page gets built client side. If you wish to replicate that in your code, then you will need to do the same. Execute the script exactly as the browser would.

What you are really looking for here is what is known as a headless browser. Integrate one of those into your program. Then get the headless browser to process the request, including executing scripts. When it has done executing scripts, read the modified content of the page.

edited Oct 06 '16 at 13:19

answered Oct 06 '16 at 13:15

David Heffernan

601,492
42
1,072
1,490

So, as I expected, I need an interpreter, a headless browser (from firefox source code for example) who knows how to run that script. Is there other languages that can do this natively, without a browser? – nostriel Oct 06 '16 at 13:19
1

I wouldn't do this in Delphi if I were you. https://github.com/dhamaniasad/HeadlessBrowsers – David Heffernan Oct 06 '16 at 13:20
1

Yes. Thank you. Unfortunately, Delphi is the main developing code, for my app, and I hit lots of walls with it. Fortunately, I know other languages, so I could make little apps in other languages and than run them from my delphi app. Between these Java, JavaScript and C++ headless browsers, do you have any favorite? – nostriel Oct 06 '16 at 13:28
1

Don't really have any recommendations on this. I think you will need to research this yourself and work out a way forward that best suits your particular needs. – David Heffernan Oct 06 '16 at 13:39
1

I have used PhantomJS (headless webkit, sort of like chrome) to do stuff like this, but not with Delphi. I'd be doing stuff like this using Karma. http://www.methodsandtools.com/tools/karma.php – Warren P Oct 06 '16 at 15:45

score 2 · Accepted Answer · edited May 23 '17 at 11:48

You can use TWebBrowser for this.

See this post: How can I get HTML source code from TWebBrowser

The answer by RRUZ, which you can find in many places on the internet, is not what you are looking for. This gives you are original html source, as would IdHttp.Get().

However, the answer by Mehmet Fide will give you the HTML source of the DOM, which is what you are looking for.

I offer a variation here. (It includes some hacks that were required at the time to get full DOCTYPE. Not sure if they are still needed...)

function EndStr(const S: String; const Count: Integer): String;
var
  I: Integer;
  Index: Integer;
begin
  Result := '';
  for I := 1 to Count do
  begin
    Index := Length(S)-I+1;
    if Index > 0 then
      Result := S[Index] + Result;
  end;
end;

function GetHTMLDocumentSource(WebBrowser: TWebBrowser; var Charset: String):
    String;
var
  Element: IHTMLElement;
  Node: IHTMLDomNode;
  Document: IHTMLDocument2;
  I: Integer;
  S: String;
begin
  Result := '';
  Document := WebBrowser.Document as IHTMLDocument2;

  For I := 0 to Document.all.length -1 do
  begin
    Element := Document.all.item(I, 0) as IHTMLElement;
    If Element.tagName = '!' Then
    begin
      Node := Element as IHTMLDomNode;
      If (Node <> nil) and (Pos('CTYPE', UpperCase(Node.nodeValue)) > 0) Then
      begin
        S := VarToStr(Node.nodeValue);  { don't change case of result }
        if Copy(Uppercase(S), 1, 5) = 'CTYPE' then
          S := 'DO' + S;
        if Copy(Uppercase(S), 1, 7) = 'DOCTYPE' then
          S := '<!' + S;
        if Uppercase(S) = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//E' then
          S := S +'N">';

        if EndStr(Lowercase(S), 3) = '.dt' then
          S := S + 'd"';
        if EndStr(Lowercase(S), 5) = '.dtd"' then
          S := S + '>';

        Result := Result + S;
      end;
    end
    Else
      Result := Result + Element.outerHTML;

    If Element.tagName = 'HTML' Then
      Break;
  end;
  Charset := Document.charset;
end;

So call WebBrowser.Navigate(URL), then in OnDocumentComplete event retrieve the Html Source.

However, with your URL you will see the OnDocumentComplete event fires twice :(, so you need to get the Html from the last fire.

You can refer to this post How do I avoid the OnDocumentComplete event for embedded iframe elements? for info on how to get the final OnDocumentComplete event. However, I tried it and it was not working for me. You may need to use some other strategy to get the last event.

Not sure of your needs, but you may also optimize this process by disabling WebBrowser from downloading images. I believe that is possible.

Indeed, it fires twice, second time is the right result. – nostriel Oct 14 '16 at 16:11 — nostriel, Oct 14 '16 at 16:11

Fetch the content of a web page with DELPHI

2 Answers2