1

i am using below code to extract url from a webpage and its working just fine but i want to filter it. it will display all urls in that page but i want only those url which consists of the word "super"

     $regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";

       }

so it should echo only uls where the word super is present. for example it should ignore url

       http://xyz.com/abc.html  

but it should echo

        http://abc.superpower.com/hddll.html

as it consists of the required word super in url

chetna123
  • 47
  • 4

1 Answers1

1

Make your regex un-greedy and it should work:

$regex = '|<a.*?href="(.*?super[^"]*)"|is';

However to parse and scrap HTML it is better to use php's DOM parser.

Update: Here is code using DOM parser:

$request_url ='1900girls.blogspot.in/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    echo $node->getAttribute('href') . "\n";
}
Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Post a sample of `$result` for which it doesn't work. – anubhava Oct 27 '13 at 05:40
  • try to filter links where word blog is appearing .for 1900girls.blogspot.in url . – chetna123 Oct 27 '13 at 05:46
  • complete code $request_url ='http://1900girls.blogspot.in/'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $request_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $result = curl_exec($ch); $regex = '|"; } ?> – chetna123 Oct 27 '13 at 05:47
  • You need to edit your question and provide that info in your question itself to get better answer. – anubhava Oct 27 '13 at 05:48
  • I THINK ITS VERY CLEAR regex to print url from any webpage with specific word in url – chetna123 Oct 27 '13 at 05:52
  • I am mobile at present, will get back to you. – anubhava Oct 27 '13 at 07:03
  • thanks anubhava.exactly same what i wanted.thank you very much – chetna123 Oct 27 '13 at 10:43
  • is there any way i can extract all links in your answer as absolute url??? at present it is working fine but it is giving all urls relative as well as absolute.i dont have any prob with abs but for relative i have to convert them – chetna123 Nov 17 '13 at 08:51
  • Links are being extracted from the HTML source so are as per the original source. However you can prefix them with base URI in your code. – anubhava Nov 17 '13 at 09:10
  • thx. first i have to check whether href consists of http or www.if it is not there then i will prefix baseurl. – chetna123 Nov 17 '13 at 09:53
  • Yes that is exactly what I meant. – anubhava Nov 17 '13 at 10:40
  • i want to filter the results with two words in ypur answer. you have filtered with word blog in this line $needle = 'blog'; but i want to check for one more word how can i add one more word – chetna123 Nov 20 '13 at 16:57
  • Use it like this: `$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "') or contains(@href, '" . $needle1 . "')]");` – anubhava Nov 20 '13 at 17:36
  • hi thx for ur answer i want to filter out few words but this time i dont want to include . ur contains check for those words but i want to exclude facebook but facebook share url contains my filter words so i want to exclude two http in one url any idea how can i do that – chetna123 Dec 09 '13 at 16:17
  • @chetna123: Request you to please create a new question linking this one. It is not a SO philosophy to change nature of question through comments and SO admin folks won't like it. – anubhava Dec 09 '13 at 16:20
  • i am creating new question with new name as this account is blocked for asking – chetna123 Dec 09 '13 at 16:27
  • Alright sure, just leave your question link here so that I can make an attempt on it. – anubhava Dec 09 '13 at 16:28
  • my new question http://stackoverflow.com/questions/20475650/excluding-double-http-from-url – chetna123 Dec 09 '13 at 16:32