4

Based on this question: How can I get HTML source code from TWebBrowser

If I run this code with a html page that has Unicode code page, the result is gibberish becouse TStringStream is not Unicode in D7. the page might be UTF8 encoded or other (Ansi) code page encoded.

How can I detect if a TStream/IPersistStreamInit is Unicode/UTF8/Ansi?

How do I always return correct result as WideString for this function?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

If I replace TStringStream with TMemoryStream, and save TMemoryStream to file it's all good. It can be either Unicode/UTF8/Ansi. but I always want to return the stream back as WideString:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
var
  // LStream: TStringStream;
  LStream: TMemoryStream;
  Stream : IStream;
  LPersistStreamInit : IPersistStreamInit;
begin
  if not Assigned(WebBrowser.Document) then exit;
  // LStream := TStringStream.Create('');
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream,soReference);
    LPersistStreamInit.Save(Stream,true);
    // result := LStream.DataString;
    LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok
    Result := ??? // WideString
  finally
    LStream.Free();
  end;
end;

EDIT: I found this article - How to load and save documents in TWebBrowser in a Delphi-like way

Which does exactlly what I need. but it works correctlly only with Delphi Unicode compilers (D2009+). read Conclusion section:

There is obviously a lot more we could do. A couple of things immediately spring to mind. We retro-fit some of the Unicode functionality and support for non-ANSI encodings to the pre-Unicode compiler code. The present code when compiled with anything earlier than Delphi 2009 will not save document content to strings correctly if the document character set is not ANSI.

The magic is obviously in TEncoding class (TEncoding.GetBufferEncoding). but D7 does not have TEncoding. Any ideas?

Community
  • 1
  • 1
Vlad
  • 1,383
  • 14
  • 29
  • 1
    Maybe this would help http://msdn.microsoft.com/en-us/library/jj160620(v=vs.85).aspx – Sir Rufo Jan 10 '13 at 23:38
  • Try some Unicode-enabled StringList. jcl.sf.net library has `TWideStringList` and `TJclWideStringList` and TNT Unicode Components has `TWideStringList`, and i think there are more. Perhaps some of them have COM IStringList adapter as well, one way or another. Try those or try searching for more Unicode StringList implementations for Delphi 7, on google or on torry.net or some other collector – Arioch 'The Jan 11 '13 at 06:16
  • MSIE Introduced into DOM such properties as .outerHTML and .innerHTML So actually i bet you have to get to HTML DOM tree, get the HTML tag, and then get it's outerHTML property as BSTR aka WideString without intermediate COM objects. Perhaps you would neet a little JavaScript for that. Search topics like "how to click button in TWebControl" - they would rpovide you examples how to locate some tag as JS object from Delphi side and how to call its methods/properties. You would need to read outerHTML property for root HTML tag – Arioch 'The Jan 11 '13 at 06:20

1 Answers1

2

I used GpTextStream to handle the convertion (Should work for all Delphi versions):

function GetCodePageFromHTMLCharSet(Charset: WideString): Word;
const
  WIN_CHARSET = 'windows-';
  ISO_CHARSET = 'iso-';
var
  S: string;
begin
  Result := 0;
  if Charset = 'unicode' then
    Result := CP_UNICODE else
  if Charset = 'utf-8' then
    Result := CP_UTF8 else
  if Pos(WIN_CHARSET, Charset) <> 0 then
  begin
    S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint);
    Result := StrToIntDef(S, 0);
  end else
  if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591)
  begin
    S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint);
    S := Copy(S, Pos('-', S) + 1, 2);
    if S = '15' then // ISO-8859-15 (Latin 9)
      Result := 28605
    else
      Result := StrToIntDef('2859' + S, 0);
  end;
end;

function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString;
var
  LStream: TMemoryStream;
  Stream: IStream;
  LPersistStreamInit: IPersistStreamInit;
  TextStream: TGpTextStream;
  Charset: WideString;
  Buf: WideString;
  CodePage: Word;
  N: Integer;
begin
  Result := ''; 
  if not Assigned(WebBrowser.Document) then Exit;
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream, soReference);
    if Failed(LPersistStreamInit.Save(Stream, True)) then Exit;
    Charset := (WebBrowser.Document as IHTMLDocument2).charset;
    CodePage := GetCodePageFromHTMLCharSet(Charset);
    N := LStream.Size;
    SetLength(Buf, N);
    TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage);
    try
      N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar);
      SetLength(Buf, N);
      Result := Buf;
    finally
      TextStream.Free;
    end;
  finally
    LStream.Free();
  end;
end;
Vlad
  • 1,383
  • 14
  • 29