1

I have a complete HTML document string from a web page containing this BASE tag:

<BASE href="http://whatreallyhappened.com/">

In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:

BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;

This works, but only if there is only ONE space character in the subject between BASE and href.

I tried to add a quantifier to the space part in the regex (\s), but it did not work.

So how can I make this regex match the URL even if there are several spaces between BASE and href?

RRUZ
  • 134,889
  • 20
  • 356
  • 483
user1580348
  • 5,721
  • 4
  • 43
  • 105
  • Don't parse HTML (or XML) with regular expressions. For a number of reasons why not, start [here](http://stackoverflow.com/q/701166). – Ken White Aug 10 '14 at 02:00
  • Why do you need lookarounds for this simple task? – CSᵠ Aug 10 '14 at 02:49
  • regex: ` – CSᵠ Aug 10 '14 at 02:55
  • @KenWhite I tried using IHTMLDocument2 in Delphi XE2, but there seems to be no documentation for using IHTMLDocument2 in Delphi (intellisense does not work with the IHTMLDocument2 namespace in Delphi). What I would need is a rich and easy to use Delphi library wrapping IHTMLDocument2 or whatever existing reliable HTML parsing standard which gives me easy functions like: `GetSpecificAttributeFromSpecificTag(ASpecificTag, ASpecificAttribute: string)` or methods like: `InsertHTMLRightAfterBodyOpeningTag(AHTML: string)`. – user1580348 Aug 10 '14 at 09:26
  • The documentation is at MSDN, documented in the IHTMLDocument2 interface. IHTMLDocument is part of Windows itself; TWebBrowser is simply a wrapper around it. For access to the full interface, import (Component->Import Component, choose Microsoft HTML Object Library), which will cretate MSHTML_TLB.pas. [MSDN](http://msdn.microsoft.com) has full documentation of all interfaces that are available. – Ken White Aug 10 '14 at 14:15

2 Answers2

2

You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.

To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
0

With lookaround

You can use different ways to try using quantifiers like these:

(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")

Working demo

enter image description here

Without lookaround

By the way, if you want just to get the content within href there is no need of lookaround you just can use:

<BASE\s+href="(.*?)"

Working demo

enter image description here

EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:

((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
          ^---notice \s        ^---notice \s\s       ^---notice \s\s\s

I know that this is horrible, but if none of above work you can try with that.

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Unfortunately, this does not work with the Delphi XE2 Regex libary (`System.RegularExpressions`). I always get `href="http://whatreallyhappened.com/` or ` – user1580348 Aug 10 '14 at 01:27
  • @user1580348 lookbehind doesn't support quantifiers only fixed length like {30} that's why I moved `\s` outside the lookbehind. Can you use the regex with the \s outside the lookbehind? Asking you since I'm trying to help you – Federico Piazza Aug 10 '14 at 01:30
  • @user1580348 By the way can't you just use `".*?"` ? – Federico Piazza Aug 10 '14 at 01:34
  • Thank you very much for trying to help me! My question was not phrased well enough, so I rephrased it: `I have a complete HTML document string from a web page containing this BASE tag: `. So when using `".*?"` I would get all strings between double quotes. But I just need the URL from the BASE tag. Sorry. – user1580348 Aug 10 '14 at 01:48
  • @user1580348 I've updated my answer, take a look at the latest part... I know that it isn't prolix but could work – Federico Piazza Aug 10 '14 at 02:43
  • @user1580348 you can add as `\s` as you need – Federico Piazza Aug 10 '14 at 02:58