0

I want to extract a href attribute but this attributes especially has mailto function. and i want to do this not just for one link but all links belongs to main webpage.

I tried this:

<?php

$url = "https://www.omurcanozcan.com";

$html = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
$node = $xpath->query( "//a[@href='mailto:']")->item(0);


echo $node->textContent; // This will print **GET THIS TEXT**

 ?>

I expect for instance a code is

<a href='mailto:omurcan@omurcanozcan.com'>omurcan@omurcanozcan.com</a>

I want to echo

<p>omurcan@omurcanozcan.com</p>

1 Answers1

0

The main problem is that in your XPath, you are checking for

//a[@href='mailto:']

This will looks for a href attribute which only contains mailto:, what you want is where the href starts with mailto:, you can do this using starts-with()...

$node = $xpath->query( "//a[starts-with(@href,'mailto:')]")->item(0);

The second thing is that I don't think your page is fully loaded when you get the content, a common test I do is to save the HTML once I've loaded it so I can check it out first...

$url = "https://www.omurcanozcan.com";

$html = file_get_contents( $url);
file_put_contents("a.html", $html);

If you then look in a.html you can see the HTML it is using, in the content I cannot see any mailto: links.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • I want to thank you, this solves my problem very well but i can not figure out how to search for sub domains and pages for this code. – Ömürcan Özcan Jun 03 '19 at 14:42
  • Not sure what our after for that - if it's a crawler type thing then https://stackoverflow.com/questions/2313107/how-do-i-make-a-simple-crawler-in-php may help. – Nigel Ren Jun 03 '19 at 14:44
  • I want to extract all url of the pages from a website for instance omurcanozcan.com/deneme, omurcanozcan/deneme2 etc. and try to run this code in every link – Ömürcan Özcan Jun 03 '19 at 14:56
  • The above code will search for all links on the loaded page which are `mailto:` links. The only thing you need to adapt it to do is to find all of the pages you want to scan. – Nigel Ren Jun 03 '19 at 14:59
  • but i don't know the urls which belongs to webpage, how can i get full url name ? – Ömürcan Özcan Jun 03 '19 at 15:06
  • The link I posted above shows you how to get the links on a page. – Nigel Ren Jun 03 '19 at 15:14