-1

Ok so I'm just now starting to play around with preg_match in PHP and I've run into a problem. I am pulling information from an RSS feed which is giving me a description. In the description I want to pull the web address from the second to last link in the description. There may be any number of links, but the link I need is always second to last. However, when I use this code it pulls all of the links in the description:

preg_match('#<a href=".*</a>#', $description, $match);

I've tried PREG_OFFSET_CAPTURE, which seems to do the same thing as far as the output is concerned. Everything I have googled is only telling me how to grab the last things in the string.

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Nash
  • 542
  • 1
  • 7
  • 16
  • 7
    [Don't use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – John Conde Aug 11 '14 at 15:52
  • 6
    TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – j08691 Aug 11 '14 at 15:53
  • your `$match` is going to be an array of results. pull off the second last entry, and **MAYBE**, if the pony didn't stomp all over you, it'll actually be the url you're looking for. – Marc B Aug 11 '14 at 15:55
  • @JohnConde so what should I use instead in this situation? There is no way to predict the words coming out in this feed, and I'd like to get this link. – Nash Aug 11 '14 at 16:50

1 Answers1

3

Grab the last two and extract the first among them, then:

preg_match('#.*(<a href=".*?</a>).*?<a href=".*?</a>#', $description, $match);

Your match is in first capturing group.

This works because:

.* is a greedy capture. It skips your pointer to the end of the string, then attempts to match at the next position: <a href=". The allows the regex state engine to backtrack to the last URL HTML tag, and matches it all up to </a> After it finds this tag, it attempts to match the next position: .*?<a href=. This lets our state engine backtrack (again), so now our capturing group matches the second to last URL tag, while the last tag is matched outside the group - into the garbage bin.

Unihedron
  • 10,902
  • 13
  • 62
  • 72
  • Thank you. What does the `.*?` do? – Nash Aug 11 '14 at 17:04
  • `.*?` matches everything, just like `*`, except it's "lazier" - Instead of matching everything possible (Skip to the end of match sequence), it matches one character at a time until it can jump to the next sequence. When parsing [X]HTML, the greedy-lazy matches has to be used carefully to get your regex running optimal! [Otherwise you'll end up like these poor souls.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Unihedron Aug 11 '14 at 17:07
  • ok so I can see the difference, what would I put if say I wanted to grab the third from the last link? Would I just add another `.*? – Nash Aug 11 '14 at 17:32
  • Without the `#` because that's part of the regex syntax, but yes. – Unihedron Aug 11 '14 at 17:33