Given a URL Find The Contact Link

Question

I have a url and I have to find a contact link within the url.

So what I did was use the simple_html_dom.php to loop through all the a tags and if it contains the word "contact" or "advertis" then it's the contact url. But this is actually very slow.

So what i want to do now is scrape the page using curl (no problem, even multi curl in the future) and have a regex try to find within the scrape result a a href link if the link contains either "contact" or "advertis".

I would use preg_match_all but what would the regex be?

how about leverageing google to do the search\crawlying for you? — , Dec 11 '12 at 22:31
Dom, xpath, `//a[contains(@href,'contact')]`... done. And 'simple dom' is for people who have a gazillion spare cpu cycles... slow, slow, slow. — Wrikken, Dec 11 '12 at 22:32
@Alexander: yet another non-libxml based html/xml parser. See [this](http://stackoverflow.com/a/3577662/358679) for a comparison. — Wrikken, Dec 11 '12 at 22:36
You don't want to use regular expressions to parse HTML. They are not up to the task. http://htmlparsing.com/regexes.html explains why, and http://htmlparsing.com/php.html gives examples of how to parse HTML with the DOM module. — Andy Lester, Dec 11 '12 at 22:40
@andy, is the DOM module fast? the simple_html_dom was super slow. But i guess the DOM module would be faster since its a php module. I will give it a try. — gprime, Dec 11 '12 at 22:53
@gprime: I suggest that it doesn't matter how slow the DOM module is if it gives you accurate parsing and the regexes don't. Any code can be fast if you don't care if it works correctly. — Andy Lester, Dec 12 '12 at 00:53

Joseph at SwiftOtter · Accepted Answer · 2012-12-11T22:37:58.697

1

preg_match_all('/\<a href\=\"(.*?(contact|advertis)+.*?)\"\>(.+?)\<\/a>/m', $page, $matches);

edited Dec 11 '12 at 22:37

answered Dec 11 '12 at 22:30

Joseph at SwiftOtter

4,276
5
37
55

Thanks, this works. I will use this or the DOM module. Thanks! – gprime Dec 11 '12 at 22:53

Given a URL Find The Contact Link

1 Answers1