1

I made the following regex using Regexpal: ?href="([^"])"

And am calling it using the built in XE-4 unit "RegularExpressions":

  Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);

The goal is to extract from the html source all the links on the page, so that I can show them in a TListView without using the TWebView component.

As a first example the link: "http://www.splitbrain.org/_static/ico/farmfresh/" has a page source of about 142KB. When running the below code, the peak memory hits about 530MB. Quite hefty, but works. It comes up with a resulting matchlist of about 1400 items.

As a secondary example the link "http://www.splitbrain.org/_static/ico/fugue/" has a page source of about 338KB. When runnng the below code, the peak memory hits about 1.7GB, before throwing an "out of memory" exception. Clearly the straightforward solution is not going to work for larger pages.

I realize I could read the page source line by line, and analyze each line using the regex. I suspect this may have a performance impact, but at least the peak memory should be a lot lower.

I was wondering, is TRegex really suited for analyzing this kind of data? I noticed several reports about TRegex having unresolved bugs. (Sorry, I'd quote a direct source, but I'm still limited to 2 links. Long time reader, first time poster as of today.)

If not (as appears to be the case), what would be the best bet for the sake of speed/performance and lower peak memory usage? I found PCRE may be an option but, if possible I'd like to limit external libraries as much as possible. If I were to include PCRE, could this be implemented with minimal code changes? (e.g. is the regex used compatible?)

Sample code:

function TFrmMain.FGetURLSourceAsString(const aURL: string; Depth: Integer): string;
var
  Matches: TMatchCollection;
  Url: String;
begin
  // Set UserAgent. This is needed to prevent the following error: "HTTP/1.1 403 Forbidden."
  lHTTP.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon)';

  // Unhandled redirects will cause a 301 error. See: http://stackoverflow.com/questions/4549809/indy-idhttp-how-to-handle-page-redirects
  lHTTP.HandleRedirects := True;
  lHTTP.RedirectMaximum := 35;

  //todo: we dont actually support https yet,. it needs an iohandler

  // If url has no http in the front, add it. Otherwise indy will complain about "unknown protocol".
  if AnsiPos('http', Url) = 0 then
    Result := lHTTP.Get('http://' + Url)
  else
    Result := lHTTP.Get(Url);

  //Analyze for possible meta refreshing:
  //Example: <meta http-equiv="refresh" content="1;url=http://urlhere">
  Matches := TRegex.Create('<meta.*?content=.*?url=([^"]*)"').Matches(Result);
  if (Matches.Count > 0) and (Depth < 5) then begin
    Url := Matches.Item[0].Groups[1].Value;
    Result := FGetURLSourceAsString(Url, Depth+1);
  end else begin
    //if Depth >= 5 then
    //todo message max depth reached
    //Just return Result as is
  end;
end;

procedure TFrmMain.BtnLinksClick(Sender: TObject);
var
  PageSource: String;
  Matches: TMatchCollection;
begin
  LvResultSpeeds.Clear();
  PageSource := FGetURLSourceAsString(EditURL.Text, 0);

  //todo this quickly jumps to 1.7 GB memory usage on the splitbrain url
  Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);
end;
T.S
  • 355
  • 4
  • 18
  • See [Extract Links From an HTML Page Using Delphi](http://delphi.about.com/od/internetintranet/a/extract-links-from-a-html-page-using-delphi.htm). Using lazy dot matching with large input is fraught with timeout and other issues like the one you are having. – Wiktor Stribiżew Jan 01 '16 at 18:31
  • 1
    In general, regex cannot analyse html. Why not use a parser? – David Heffernan Jan 01 '16 at 18:45
  • @stribizhev: I modified the code you linked to use a tstringlist. It then works for most things that use plain a-hrefs but not for the examples I linked. Likely because it does not support area shapes or image maps. – T.S Jan 01 '16 at 19:26
  • @David, such as the one stribizhev linked? I would contemplate using a TWebBrowser or com object, if there was one that did the expected task well. The TRegex does work however,. memory just seems to be the main issue for larger pages. I shall give this a try later: [link](http://stackoverflow.com/questions/14348346/html-tag-parsing) – T.S Jan 01 '16 at 19:33
  • 3
    No, regex cannot be used to parse html: http://stackoverflow.com/a/1732454/505088 – David Heffernan Jan 01 '16 at 19:36
  • @DavidHeffernan Note that I'm not actually trying to "parse" the html,. I just wish to grab a list of all the links to give the user something he might click and browse. Which should simplify the issue,. or perhaps complicate it in this case. – T.S Jan 01 '16 at 19:41
  • 1
    That's parsing. You don't have to believe me. – David Heffernan Jan 01 '16 at 20:06
  • 2
    As David says, extracting the links from the page is parsing. It's much easier to use an actual DOM parser for doing so. As for bugs in TRegEx, I'm a little dubious about that claim. TRegEx is based on TPerlRegex, which is a freeware library; I've used it for several years before it was included in Delphi without issues. – Ken White Jan 02 '16 at 00:19

0 Answers0