PHP: getting the URL from HTML by its URL descriptor

Question

From within PHP, how can I get the URL of a certain href by only knowing its name/description text? For example, how do I get the URL to the site map from Apples main page by searching for the string 'Site map'?

So, when starting I only know the site I want to crawl (e.g. www.apple.com) and the URL descriptor I'm interested in (e.g. 'Site map'). The correct output for the solution should be: http://www.apple.com/sitemap/

Any idea on how to solve this is highly appreciated.

score 0 · Answer 1 · answered Aug 19 '13 at 14:55

0

Maybe with an Regular expression?

$url = 'http://www.apple.de';
$name = 'Site Map';
$content = file_get_contents($url);
if(preg_match('/<\s*a[^>]*href\s*=\s*("([^"]+)"|\'([^\']+)\')[^>]*>.*?'.$name.'.*?<\s*\/\s*a\s*>/i',$content,$matches))
     print_r($matches);

answered Aug 19 '13 at 14:55

cyper

436
4
12

Regular expression parsing of html, xml, etc is almost *never* a good idea. [See our favorite stack-o answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). One of the many DOM libraries is a much preferred choice. – David Aug 19 '13 at 19:51
Thanks, the regex does the job and the result of your code snippet above looks like this: `[0] => Privacy Policy [1] => "/legal/privacy/" [2] => /legal/privacy/` – user1736982 Aug 20 '13 at 08:27
I updated the regex slightly. Works like a sharm: `preg_match( '/<\s*a\s*href\s*=\s*("([^"]+?)"|\'([^\']+?)\')[^>]*?>[^<]*?'.$name.'[^<]*?<\s*\/\s*a\s*>/i', $content, $matches );` – user1736982 Aug 21 '13 at 09:51

score 0 · Answer 2 · edited May 23 '17 at 11:49

0

After commenting in the negative on another answer, I don't like to propose my own, but this question looks maybe low-interest for many folks.

In HTML, urls frequently look like the following:

<a href="http://www.apple.com/sitemap/" >http://www.apple.com/sitemap/</a>

So, what you need is the href attribute of the url's tag.

There are many different ways to do this, and it's kind of academic, which is likely why few other people have posted answers.

To parse the page, a DOM Parsing library is the best choice. Here is a good answer listing many options. Study some of them.

I, personally, like to use XPath-based DOM parsing libraries, and frequently use the DOMDocument library that comes pre-packaged with standard php.

W3Schools has a pretty good XPath tutorial.

edited May 23 '17 at 11:49

Community

1
1

answered Aug 19 '13 at 20:22

David

13,133
1
30
39

Thanks for your answer. I agree that URLs in HTML many times looks like the example you show. However, in this case I know for sure that the URL descriptor is a static text (e.g. 'Site map') so based on this could you elaborate on how to solve this problem using an XPath-based DOM parser? – user1736982 Aug 20 '13 at 08:16
Html uses tags to identify conceptual portions of text. There is no concept of "static text" in html, it *will* be within a tag. I don't have time to answer in detail, but will provide some links. [Here is an introduction to Html](http://www.w3schools.com/html/html_intro.asp). [Here is a stack-o question about how to parse a page using DOMDocument and DOMXPath](http://stackoverflow.com/questions/5493525/php-dom-xpath). Best wishes! – David Aug 20 '13 at 14:13
Allright, what I meant to say was that I know for certain that the web pages I try to crawl have the text string 'Site map' as a description of the URLs. Anyways, I guess I'll use the regex below since it works instead of trying to get a DOMXPath solution working. – user1736982 Aug 21 '13 at 09:48

PHP: getting the URL from HTML by its URL descriptor

2 Answers2