1

I'm looking for preg_match_all pattern to find all URL on a page that don't have trailing slash.

For example: if I have

  1. a href="/testing/abc/">end with slash

  2. a href="/testing/test/mnl">no ending slash

The result would be #2

Thanks.

user2170712
  • 17
  • 2
  • 6

2 Answers2

1

Better extract all your href links using DOM parser and see if URL is ending with slash or not. No regex needed for that.

For the regex solution for the examples provided you can use this regex:

/href=(['"])[^\s]+(?<!\/)\1/

Live Demo: http://www.rubular.com/r/f2XJ6rF5Fb

Explanation:

href=   -> match text href=
(['"])  -> match single or double quote and create a group #1 with this match
[^\s]+  -> match 1 or more character until a space is found
(?<!\/) -> (negative lookbehind) only match if is not preceded by /
\1      -> match closing single or double quote (group #1)
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • it works! could you please give a brief explanation. thank you very much – user2170712 Mar 14 '13 at 16:49
  • I'm trying to get all href links that have no trailing slash, but exclude links that has 'images' text/path or .pdf in it. I have tried regex look behind but no sucess. Your suggestion is appreciated. – user2170712 Mar 19 '13 at 16:28
  • Since that looks like a new requirement can you please create a question and I will be happy to provide answer. – anubhava Mar 19 '13 at 16:47
  • here is the new question: http://stackoverflow.com/questions/15505610/regex-pattern-for-url-with-no-ending-slash-and-exclude-certain-text-in-url – user2170712 Mar 19 '13 at 16:52
  • Oh thanks, looks like someone answered your question. Pls let me know if that doesn't solve your problem then I can also find a regex for you. – anubhava Mar 19 '13 at 17:37
0

Indeed, use a DOM parser [why?]. Here's an example:

// let's define some HTML
$html = <<<'HTML'
<html>
<head>
</head>
<body>
    <a href="/testing/abc/">end with slash</a>
    <a href="/testing/test/mnl">no ending slash</a>
</body>
</html>
HTML;

// create a DOMDocument instance (a DOM parser)
$dom = new DOMDocument();
// load the HTML
$dom->loadHTML( $html );

// create a DOMXPath instance, to query the DOM
$xpath = new DOMXPath( $dom );

// find all nodes containing an href attribute, and return the attribute node
$linkNodes = $xpath->query( '//*[@href]/@href' );

// initialize a result array
$result = array();

// iterate all found attribute nodes
foreach( $linkNodes as $linkNode )
{
    // does its value not end with a forward slash?
    if( substr( $linkNode->value, -1 ) !== '/' )
    {
        // add the attribute value to the result array
        $result[] = $linkNode->value;
    }
}

// let's look at the result
var_dump( $result );
Community
  • 1
  • 1
Decent Dabbler
  • 22,532
  • 8
  • 74
  • 106