-1

I am trying to write a regex in PHP that allows me to capture the last instance of an HTML tag right before an instance of another HTML tag.

For example, if I have the following HTML:

<p>Para #1</p><p><a href="/path/to/keyword-here/21">Link Here</a> Para #2</p><p>Para #3</p>

I want to capture just the following, with capturing groups for keyword-here and 21:

<p><a href="/path/to/keyword-here/21">Link Here</a> Para #2</p>

I tried using the following regex, but it ended up getting everything from <p>Para #1 to the </p> after Para #2, which is too much:

'#<p.*?<a .*?(keyword-here)/(\d+).*?</a>.*?</p>#'

Because that didn't work, I then tried adding a negative lookahead as follows, but that causes no matches to be returned at all:

'#<p(?!.*<p).*?<a .*?(keyword-here)/(\d+).*?</a>.*?</p>#'

So now I'm stuck. The first regex captures too much, the second is too restrictive and doesn't match anything at all. Where's the balance in the middle to get what I'm after?

What am I missing? Am I close or completely approaching this in the wrong way? Thank you.

HartleySan
  • 7,404
  • 14
  • 66
  • 119
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – AbraCadaver Aug 01 '19 at 19:14
  • 2
    You should be using a proper parser to process HTML, using regex can cause problems. Have a look at using DOMDocument instead, using this and XPath can make your job a lot easier.. – Nigel Ren Aug 01 '19 at 19:17
  • Are these comments to imply that using a regex to do this is impossible? If so, please just say so, and I'll use another solution. If a regex is possible to solve the problem though, I would prefer to go that route. – HartleySan Aug 01 '19 at 19:23
  • Any particular reason you're married to a regex-based solution for this? Using a proper HTML parser is likely the *correct* way to go here, especially considering the requirement that you'll need information on individual elements' context within the larger document. – esqew Aug 01 '19 at 19:31
  • Not married to a regex solution, but I think sometimes there are other considerations to be made beyond just what is asked in a particular question. For example, I didn't want to get into limitations in the current codebase, various time constraints I'm under, my lack of familiarity with DOMDocument, etc., etc. Point being, I don't doubt that a non-regex solution is probably ideal for my problem in a vacuum, but in the real world, which has a lot of other variables at play, maybe a regex solution is ideal. That said, Nigel Ren provided a very simple and clear solution, and I'm fine with that. – HartleySan Aug 01 '19 at 19:58

1 Answers1

2

Using DOMDocument and XPath, you can use the following code...

$html = '<p>Para #1</p><p><a href="/path/to/keyword-here/1">Link Here</a><a href="/path/to/keyword-here/21">Link Here</a> Para #2</p><p>Para #3</p>';

$doc = new DOMDocument();
$doc->loadHTML($html);

$xp = new DOMXPath($doc);
$href = $xp->evaluate("string(//p/a[last()]/@href)");
echo $href;

which gives...

/path/to/keyword-here/21

The XPath expression - //p/a[last()]/@href will look for any <p> element with an <a> element directly under it, the [last()] does as it seems and will get the last tag. Then @href will get the href attribute.

Note that I updated the HTML to include a new first <a> tag with /path/to/keyword-here/1 as the href, but the code still returns /path/to/keyword-here/21.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55