1

i am referring this link to extract url from a webpage which contains specific words

regex to print url from any webpage with specific word in url

but few urls like pinterest and facebook referal url contains the words which is interested to me but i dont want to use facebook ,pinterest urls as they are not the direct url so i want to exclude these urls so i have observed that these urls will contain atleast two http

something like this

http://www.pinterest.com/pin/create/button/?url=http%3A%2F%2Fwww.glamsham.com%2Fpicture-gallery%2Fsensual-in-saree-gallery%2Fspecials%2F3774%2F7%2Findex.htm&media=http%3A%2F%2Fmedia.glamsham.com%2Fdownload%2Fpicturegallery%2Ffeatured%2Fbollywood-beauties-saree%2F722-sensual-in-saree.jpg&guid=gNh5ehWodCZW-0&description=Rani%20Mukerji%20in%20saree%20at%20Sensual%20in%20saree%20picture%20gallery%20picture%20%23%207%20%3A%20glamsham.com

so i want to exclude urls which contains atleast two http

Community
  • 1
  • 1
Priya
  • 165
  • 1
  • 12
  • http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links/16509122#16509122 – amarjit singh Dec 09 '13 at 16:52
  • `preg_match('/(http.*?)http/', 'https://foo.bar.baz/q=http://blah.com', $matches);` -- ungreedy matching of any two `http` with anything in between. – Damon Dec 09 '13 at 17:01

2 Answers2

0

You can try something like this avoid these URIs:

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $href = $node->getAttribute('href');
    if ( !preg_match('~^http://.+?https?\b~i', $href) )
       echo "$href\n";
}

preg_match('~^http://.+?https?\b~i', $href) should match these to-be-excluded URIs

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

I'd probably check as you loop through them and remove the ones with double http's, for example:

$request_url ='YOUR URL';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
$validUrls = array();
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $curUrl = $node->getAttribute('href');
    if (substr_count($curUrl,'http')===1) {
        $validUrls[] = $curUrl;
    }
}

var_dump($validUrls); // all urls with only one "http"
Nick
  • 6,316
  • 2
  • 29
  • 47