excluding double http from url

Question

i am referring this link to extract url from a webpage which contains specific words

regex to print url from any webpage with specific word in url

but few urls like pinterest and facebook referal url contains the words which is interested to me but i dont want to use facebook ,pinterest urls as they are not the direct url so i want to exclude these urls so i have observed that these urls will contain atleast two http

something like this

http://www.pinterest.com/pin/create/button/?url=http%3A%2F%2Fwww.glamsham.com%2Fpicture-gallery%2Fsensual-in-saree-gallery%2Fspecials%2F3774%2F7%2Findex.htm&media=http%3A%2F%2Fmedia.glamsham.com%2Fdownload%2Fpicturegallery%2Ffeatured%2Fbollywood-beauties-saree%2F722-sensual-in-saree.jpg&guid=gNh5ehWodCZW-0&description=Rani%20Mukerji%20in%20saree%20at%20Sensual%20in%20saree%20picture%20gallery%20picture%20%23%207%20%3A%20glamsham.com

so i want to exclude urls which contains atleast two http

http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links/16509122#16509122 — amarjit singh, Dec 09 '13 at 16:52
`preg_match('/(http.*?)http/', 'https://foo.bar.baz/q=http://blah.com', $matches);` -- ungreedy matching of any two `http` with anything in between. — Damon, Dec 09 '13 at 17:01

score 0 · Accepted Answer · answered Dec 09 '13 at 16:43

0

You can try something like this avoid these URIs:

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $href = $node->getAttribute('href');
    if ( !preg_match('~^http://.+?https?\b~i', $href) )
       echo "$href\n";
}

preg_match('~^http://.+?https?\b~i', $href) should match these to-be-excluded URIs

answered Dec 09 '13 at 16:43

anubhava

761,203
64
569
643

http://stackoverflow.com/questions/1188129/replace-urls-in-text-with-html-links/16509122#16509122 – amarjit singh Dec 09 '13 at 16:53

score 0 · Answer 2 · answered Dec 09 '13 at 16:44

I'd probably check as you loop through them and remove the ones with double http's, for example:

$request_url ='YOUR URL';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
$validUrls = array();
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $curUrl = $node->getAttribute('href');
    if (substr_count($curUrl,'http')===1) {
        $validUrls[] = $curUrl;
    }
}

var_dump($validUrls); // all urls with only one "http"

excluding double http from url

2 Answers2

Linked