I made the following regex using Regexpal: ?href="([^"])"
And am calling it using the built in XE-4 unit "RegularExpressions":
Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);
The goal is to extract from the html source all the links on the page, so that I can show them in a TListView without using the TWebView component.
As a first example the link: "http://www.splitbrain.org/_static/ico/farmfresh/" has a page source of about 142KB. When running the below code, the peak memory hits about 530MB. Quite hefty, but works. It comes up with a resulting matchlist of about 1400 items.
As a secondary example the link "http://www.splitbrain.org/_static/ico/fugue/" has a page source of about 338KB. When runnng the below code, the peak memory hits about 1.7GB, before throwing an "out of memory" exception. Clearly the straightforward solution is not going to work for larger pages.
I realize I could read the page source line by line, and analyze each line using the regex. I suspect this may have a performance impact, but at least the peak memory should be a lot lower.
I was wondering, is TRegex really suited for analyzing this kind of data? I noticed several reports about TRegex having unresolved bugs. (Sorry, I'd quote a direct source, but I'm still limited to 2 links. Long time reader, first time poster as of today.)
If not (as appears to be the case), what would be the best bet for the sake of speed/performance and lower peak memory usage? I found PCRE may be an option but, if possible I'd like to limit external libraries as much as possible. If I were to include PCRE, could this be implemented with minimal code changes? (e.g. is the regex used compatible?)
Sample code:
function TFrmMain.FGetURLSourceAsString(const aURL: string; Depth: Integer): string;
var
Matches: TMatchCollection;
Url: String;
begin
// Set UserAgent. This is needed to prevent the following error: "HTTP/1.1 403 Forbidden."
lHTTP.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon)';
// Unhandled redirects will cause a 301 error. See: http://stackoverflow.com/questions/4549809/indy-idhttp-how-to-handle-page-redirects
lHTTP.HandleRedirects := True;
lHTTP.RedirectMaximum := 35;
//todo: we dont actually support https yet,. it needs an iohandler
// If url has no http in the front, add it. Otherwise indy will complain about "unknown protocol".
if AnsiPos('http', Url) = 0 then
Result := lHTTP.Get('http://' + Url)
else
Result := lHTTP.Get(Url);
//Analyze for possible meta refreshing:
//Example: <meta http-equiv="refresh" content="1;url=http://urlhere">
Matches := TRegex.Create('<meta.*?content=.*?url=([^"]*)"').Matches(Result);
if (Matches.Count > 0) and (Depth < 5) then begin
Url := Matches.Item[0].Groups[1].Value;
Result := FGetURLSourceAsString(Url, Depth+1);
end else begin
//if Depth >= 5 then
//todo message max depth reached
//Just return Result as is
end;
end;
procedure TFrmMain.BtnLinksClick(Sender: TObject);
var
PageSource: String;
Matches: TMatchCollection;
begin
LvResultSpeeds.Clear();
PageSource := FGetURLSourceAsString(EditURL.Text, 0);
//todo this quickly jumps to 1.7 GB memory usage on the splitbrain url
Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);
end;