preg_match: return entire url by matching a word inside it

Question

If I have texts containing:

<h1> Test </h1>
<some html elements>
<a href="www.example.com/test?abc=xxxx&def=yyyy&ghi=zzzzz"></a>
<more html elements>

How to preg_match by matching a word having "abc=xxxx" so I get:

www.example.com/test?abc=xxxx&def=yyyy&ghi=zzzzz

Yes. The first one was to extract the "xxx" so my regex was '/abc=(.*?)\&/' . But now I need to get the entire url. — typescript, Apr 06 '15 at 15:04
You should parse the url and query string instead of using a regex. — jeroen, Apr 06 '15 at 15:05
if that text is html, then use DOM and some xpath. `//a[contains(@href,'abc=')]`? — Marc B, Apr 06 '15 at 15:06
@Marc B, how to implement the said snippet? Is it pregmatch(//a[contains(@href,'abc=')],,); ? — typescript, Apr 06 '15 at 15:17
@MarcB, the text being looked at is from a curl_exec call, will that still work? — typescript, Apr 06 '15 at 15:20
doesn't matter where it comes from. it'll just be html text. — Marc B, Apr 06 '15 at 15:23
@MarcB, forgot to ask this at the start but why should i use dom+xpath instead of preg_match? — typescript, Apr 06 '15 at 15:26
because regexes + html = insanity, especially if it's arbitrary/possibly malformed html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Marc B, Apr 06 '15 at 15:29

hakre · Answer 1 · 2015-04-07T17:38:04.753

As you search for the URL something here, it's worth to make clear what makes it different from it's context.

A URL in general does not contain whitespace and often it's enclosed in some kind of quoting or parenthesis that makes it easy to spot:

 URL   -- surrounded by whitespace --
"URL"  -- quoted like in your example --
<URL>  -- the class way of marking an URL --

That would allow to describe the URL as the following expression

~(?P<url>[^\s<>"\']+)~

Running this alone on the document in your example already does some sort of work here, it gives 13 results of which 12 are false positives, but the URL is in:

#1 h1,            #2 Test,           #3 /h1,
#4 some,          #5 html,           #6 elements,
#7 a,             #8 href=,        

#9 www.example.com/test?abc=xxxx&def=yyyy&ghi=zzzzz,

#10 /a,           #11 more,          #12 html,
#13 elements.

Luckily you have even more criteria what a URL is in your case, so this can be added. For example a query string must be there. That it, it must contain a question mark:

~(?P<url>[^\s<>"\'?]+\?[^\s<>"\'?]+)~

The question mark has been excluded from the allowed character group, the group has been split in two and the question mark is now required in the middle. As a URL can only contain it once, this is perfectly fine.

And now there is only one match left.

www.example.com/test?abc=xxxx&def=yyyy&ghi=zzzzz

As this now is really hard to read, let's write this better down:

~
    (?(DEFINE)
        (?<Chars> [^\s<>"\'?]+)
    )

    (?P<url> (?&Chars) \? (?&Chars) )
~x

And that's not yet still the end, as you clearly know what you're looking for, that abc=(.*?)& part. It's a little wrong, the value is terminated by & so it can't contain it. As with the question mark, this should be put into it's pattern and as such a value can be at the end of the URL as well, the rest following can be made optional:

~
    (?(DEFINE)
        (?<Chars> [^\s<>"\'?]+ )
        (?<Val> [^\s<>"\'?&]* )
    )

    (?P<url>   (?&Chars) \? (?&Chars)? abc = (?P<value> (?&Val) ) &? (?&Chars)?  )
~x

So as long as you're interested in a specific URL it's relatively simple to do this with a regular expression, but: the URL in the document must not be normalized and other similar problems can be. So it's normally most worthy to normalize URLs first and then proceed with them. For example when looking URL parameters in the query-info part.

While writing this I actually think, the filtering of the URLs you're getting from the document should be independent to the parsing method. As other users have commented, you perhaps want to use a HTML parser instead of regular expressions. Or you perhaps want both.

Let's first take care of the regular expression scenario. Here is a regex variant with correct URL parsing. As a precaution, the maximum length of URLs has been limited from 6 to 256 bytes:

$matcher  = new PregStringMatcher('~([^\s<>"\']{6,256})~');
$segments = new StringMatcherIterator($matcher, $input);
$all      = new DecoratingIterator($segments, 'Net_URL2');
$urls     = new CallbackFilterIterator($all, function (Net_URL2 $url) {
    return isset($url->getQueryVariables()['abc']);
});

foreach ($urls as $url) {
    echo $url->getQueryVariables()['abc'], ' - ', $url, "\n";
}

This code makes use of classes from IteratorGarden and Pears Net_URL2. The output is (I modified your example HTML a little):

xxxx - www.example.com/test?%61%62%63=xxxx&def=yyyy&ghi=zzzzz

If you now consider to switch over to the HTML parser, you won't need to change much of that code. As the filtering logic remains the same, all you need is to exchange the underlying Traversable:

$doc   = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($input);
libxml_use_internal_errors($saved);

$attributes = (new DOMXPath($doc))->query('//@href');
$segments   = new DecoratingIterator($attributes, function (DOMAttr $attr) {
    return $attr->nodeValue;
});

The rest of the code can remain the same and the result in this case is the same. So I hope these exampes are useful and show both some ways on how to deal with the regex pattern as well on how to add more checks.

Here the code-example in full with both the regex and the HTML parser. The URL filter is the same in both:

<?php
/**
 * preg_match: return entire url by matching a word inside it
 *
 * @link http://stackoverflow.com/a/29481904/367456
 */

require __DIR__ . '/vendor/autoload.php';

$input = <<<BUFFER
<h1> Test </h1>
<some html elements>
<a href="www.example.com/test?%61%62%63=xxxx&def=yyyy&ghi=zzzzz"></a>

<more html elements>
BUFFER;

// Regex based retrieval

$matcher  = new PregStringMatcher('~([^\s<>"\']{6,256})~');
$segments = new StringMatcherIterator($matcher, $input);
$all  = new DecoratingIterator($segments, 'Net_URL2');
$urls = new CallbackFilterIterator($all, function (Net_URL2 $url) {
    return isset($url->getQueryVariables()['abc']);
});

foreach ($urls as $url) {
    echo $url->getQueryVariables()['abc'], ' - ', $url, "\n";
}

// DOMDocument based retrieval

$doc   = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($input);
libxml_use_internal_errors($saved);

$attributes = (new DOMXPath($doc))->query('//@href');
$segments   = new DecoratingIterator($attributes, function (DOMAttr $attr) {
    return $attr->nodeValue;
});
$all  = new DecoratingIterator($segments, 'Net_URL2');
$urls = new CallbackFilterIterator($all, function (Net_URL2 $url) {
    return isset($url->getQueryVariables()['abc']);
});

foreach ($urls as $url) {
    echo $url->getQueryVariables()['abc'], ' - ', $url, "\n";
}

preg_match: return entire url by matching a word inside it

1 Answers1

Linked