How do I get the second to last HTML link in PHP regex?

Question

Ok so I'm just now starting to play around with preg_match in PHP and I've run into a problem. I am pulling information from an RSS feed which is giving me a description. In the description I want to pull the web address from the second to last link in the description. There may be any number of links, but the link I need is always second to last. However, when I use this code it pulls all of the links in the description:

preg_match('#<a href=".*</a>#', $description, $match);

I've tried PREG_OFFSET_CAPTURE, which seems to do the same thing as far as the output is concerned. Everything I have googled is only telling me how to grab the last things in the string.

[Don't use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — John Conde, Aug 11 '14 at 15:52
TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — j08691, Aug 11 '14 at 15:53
your `$match` is going to be an array of results. pull off the second last entry, and **MAYBE**, if the pony didn't stomp all over you, it'll actually be the url you're looking for. — Marc B, Aug 11 '14 at 15:55
@JohnConde so what should I use instead in this situation? There is no way to predict the words coming out in this feed, and I'd like to get this link. — Nash, Aug 11 '14 at 16:50

Unihedron · Accepted Answer · 2014-08-11T16:58:13.873

3

Grab the last two and extract the first among them, then:

preg_match('#.*(<a href=".*?</a>).*?<a href=".*?</a>#', $description, $match);

Your match is in first capturing group.

This works because:

.* is a greedy capture. It skips your pointer to the end of the string, then attempts to match at the next position: <a href=". The allows the regex state engine to backtrack to the last URL HTML tag, and matches it all up to </a> After it finds this tag, it attempts to match the next position: .*?<a href=. This lets our state engine backtrack (again), so now our capturing group matches the second to last URL tag, while the last tag is matched outside the group - into the garbage bin.

edited Aug 11 '14 at 16:58

answered Aug 11 '14 at 15:55

Unihedron

10,902
13
62
72

Thank you. What does the `.*?` do? – Nash Aug 11 '14 at 17:04
`.*?` matches everything, just like `*`, except it's "lazier" - Instead of matching everything possible (Skip to the end of match sequence), it matches one character at a time until it can jump to the next sequence. When parsing [X]HTML, the greedy-lazy matches has to be used carefully to get your regex running optimal! [Otherwise you'll end up like these poor souls.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Unihedron Aug 11 '14 at 17:07
ok so I can see the difference, what would I put if say I wanted to grab the third from the last link? Would I just add another `.*? – Nash Aug 11 '14 at 17:32
Without the `#` because that's part of the regex syntax, but yes. – Unihedron Aug 11 '14 at 17:33

How do I get the second to last HTML link in PHP regex?

1 Answers1