0

If the string is

<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>.  Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="test.pdf">

The regex I have written is

/href=.*?.pdf/

This results in capturing the first 'href' and ending with '.pdf'. I need it to start with the second href instead. In other words it should only capture the href that ends with .pdf

How should I go about this using regex?

dudemanbearpig
  • 1,264
  • 3
  • 12
  • 19
  • 3
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 10 '13 at 16:49

2 Answers2

2

You can try this regex:

/href=[^>]+\.pdf/

regex101 demo

Most of the time, when you can avoid .* or .+ (or their lazy versions), it's better :)

Also, don't forget to escape periods.

Jerry
  • 70,495
  • 13
  • 100
  • 144
2

You should use DOM instead of using a regex in order to parse HTML or XML. In PHP there is the DOMDocument class for this:

$doc = new DOMDocument();
$doc->loadHTML('<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>.  Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="http://www.police.vt.edu/VTPD_v2.1/crime_stats/crime_logs/data/VT_2011-01_Crime_Log.pdf">');

$links = $doc->getElementsByTagName('a');
foreach($links as $link) {
    echo $link->getAttribute('href');
}
hek2mgl
  • 152,036
  • 28
  • 249
  • 266