1

I have for example custom html document

<html>
<head>
    <title>Urls</title>
</head>
<body>
    <a href="https://www.google.com">Google</a>
    <a href="https://facebook.com">Facebook</a>
    <a href="http://www.example.com">Example</a>

    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse
    cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
    proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

    <h1>Heading</h1>

    <a href="www.example.com">Example</a>
</body>
</html>

How I can extract form document domain names contain example.com string?

For example I've this regex <a.+?\s*href\s*=\s*["\']?([^"\'\s>]+)["\']? which can find all urls from href attribute. But how I use Regex to find a specific URL?

Andreas Hunter
  • 4,504
  • 11
  • 65
  • 125
  • #moderators RegEx is stupidly difficult to get right (I never use it). Referencing "How do you parse and process HTML/XML in PHP?" is not a correct related question. URIs and HTML bodies are two different things. There might be a correct related question, but at the moment, I see no reason to close this question – Kind Contributor Jul 06 '20 at 12:24
  • @Todd There are a lots of ways to solve the problem expressed here in the related question, most of them are not regex. I don't understand your concern. – Félix Adriyel Gagnon-Grenier Jul 06 '20 at 13:05
  • @FélixGagnon-Grenier The term "Match" in the title refers to regex, the tag "Regex" is include, and then the OP mentions Regex. The poster is not interested in a range of general ways to parse HTML documents, they are interested in a specific Regex way for a particular kind of match. That link to other alternatives might be interesting and useful for the poster, but isn't asked for. – Kind Contributor Jul 06 '20 at 16:42
  • Ok, is it possible the regex present here could answer your question then? [php regex to extract specific domain with/without www http https from href](https://stackoverflow.com/questions/50428461/php-regex-to-extract-specific-domain-with-without-www-http-https-from-href) it even uses an "example.com" example. It's also trying to find the domain in the href. – Félix Adriyel Gagnon-Grenier Jul 10 '20 at 01:39

1 Answers1

1

To reliably extract the href values from all elements in the html document that contain www.example.com, I would use a combination of DOMDocument, Xpath, and strpos().

Xpath allows you to specifically target all href values in the document.

I am electing to trim the querystring from the href values for improved accuracy. I could not rely on parse_url() (though I would have preferred it) because your href urls are not always complete.

Code: (Demo)

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$result = [];
foreach ($xpath->query("//@href") as $href) {
    $noQueryString = explode('?', $href->nodeValue, 2)[0];
    if (strpos($noQueryString, 'www.example.com') !== false) {
        $result[] = $href->nodeValue;
    }
}
var_export($result);

Output:

array (
  0 => 'http://www.example.com',
  1 => 'www.example.com',
)
mickmackusa
  • 43,625
  • 12
  • 83
  • 136