0

I have a url and I have to find a contact link within the url.

So what I did was use the simple_html_dom.php to loop through all the a tags and if it contains the word "contact" or "advertis" then it's the contact url. But this is actually very slow.

So what i want to do now is scrape the page using curl (no problem, even multi curl in the future) and have a regex try to find within the scrape result a a href link if the link contains either "contact" or "advertis".

I would use preg_match_all but what would the regex be?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
gprime
  • 2,283
  • 7
  • 38
  • 50
  • how about leverageing google to do the search\crawlying for you? –  Dec 11 '12 at 22:31
  • Dom, xpath, `//a[contains(@href,'contact')]`... done. And 'simple dom' is for people who have a gazillion spare cpu cycles... slow, slow, slow. – Wrikken Dec 11 '12 at 22:32
  • What is `simple_html_dom.php`? Reference needed – Alexander Dec 11 '12 at 22:33
  • @Alexander: yet another non-libxml based html/xml parser. See [this](http://stackoverflow.com/a/3577662/358679) for a comparison. – Wrikken Dec 11 '12 at 22:36
  • 1
    You don't want to use regular expressions to parse HTML. They are not up to the task. http://htmlparsing.com/regexes.html explains why, and http://htmlparsing.com/php.html gives examples of how to parse HTML with the DOM module. – Andy Lester Dec 11 '12 at 22:40
  • @andy, is the DOM module fast? the simple_html_dom was super slow. But i guess the DOM module would be faster since its a php module. I will give it a try. – gprime Dec 11 '12 at 22:53
  • @gprime: I suggest that it doesn't matter how slow the DOM module is if it gives you accurate parsing and the regexes don't. Any code can be fast if you don't care if it works correctly. – Andy Lester Dec 12 '12 at 00:53

1 Answers1

1
preg_match_all('/\<a href\=\"(.*?(contact|advertis)+.*?)\"\>(.+?)\<\/a>/m', $page, $matches);
Joseph at SwiftOtter
  • 4,276
  • 5
  • 37
  • 55