preg_match_all find links, remove same results?

Question

I have a problem with matching results, this is my script, can't find how to add link from scraped content and avoid same results ?? I only need results that begin http://www.autogidas.lt/ ....

 <?
 $id= $_GET['id'];
 $user= $_GET['user'];
 $login=$_COOKIE['login'];

 $query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from autogidas where vartotojas='$user' and id='$id'");
 $rezultatas=mysql_fetch_row($query);

 $url = "$rezultatas[1]";

 $info = file_get_contents($url); 

 function scrape_between($data, $start, $end){
 $data = stristr($data, $start); 
 $data = substr($data, strlen($start));
 $stop = stripos($data, $end);
 $data = substr($data, 0, $stop);
 return str_replace('  ', ' ', $data);
 }
 $contents = scrape_between($info, "<table border=\"0\" cellspacing=\"0\">", "</table>");

   preg_match_all('/<span class="ttitle2".*?>(.*?)<\/span>/',$contents,$pavadinimas); 

   preg_match_all('/<span class="ttitle3".*?>(.*?)<\/span>/',$contents,$miestas); 

   preg_match_all('/<span class="ttitle1".*?>(.*?)<\/span>/',$contents,$metai_kaina); 

   foreach($metai_kaina[0] as $key=>$metai_kaina_val){ 
   if($key%2==0)
   $metai[] = strip_tags($metai_kaina_val);
   else  
   $kaina[] = strip_tags($metai_kaina_val);  
   }

   preg_match_all('/<img .*?(?=src)src=\"([^\"]+)\"/si', $contents, $img_link);
   preg_match_all('/<a href="http:\/\/www.autogidas.lt(.*?)"/s', $contents, $matches);

   for($i=0; $i<count($pavadinimas[0]); $i++){
    echo '<tr>
      <td><a href='HERE I NEED LINKS'><img src="'.$img_link[1][$i].'"></a></td>
      <td>'.$pavadinimas[0][$i].'</td>
      <td>'.$miestas[0][$i].'</td>
      <td>'.$metai[$i].'</td>
      <td><center>'.$kaina[$i].'</center></td>
    </tr>';
    }

   echo "</table>";
   ?>

I tried some help, but dont know how to update script, last thing what I need and can't find how to do this...Im not profi I only lerning self php for fun, thanks for help!!! Sorry for my bad English....

Then add your `http:\/\/www.adress.com` prefix to the capture group. — mario, Nov 08 '15 at 15:53

trincot · Accepted Answer · 2015-11-09T19:28:20.140

You could use this code:

// RegEx to only match with http://www.address.com/* kind of URLs in anchors
$regexp = "<a\s[^>]*href=(\"??)(http\:\/\/www\.adress\.com\/[^\" >]*?)\\1[^>]*>(.*)<\/a>";
if (preg_match_all("/$regexp/siU", $svetaines_turinys, $matches, PREG_SET_ORDER)) {
    // collect results in array
    $arr = [];
    foreach($matches as $match) {
        $arr[] = $match[2];
    }
    // remove duplicates from it
    $arr = array_unique($arr);
    // send to client
    foreach($arr as $match) {
        echo "$match <BR/>";
    }
}

EDIT after the changes made to the original question:

You want to get the unique hyperlinks because the same hyperlink is used twice on the pages you are scraping. But the two do not occur in exactly the same way, only one of the two is followed by an img tag, so you could change the regular expression for getting $matches as follows:

preg_match_all('/<a href="(http:\/\/www.autogidas.lt[^"]*)"\s*>\s*<img/s',
    $contents, $matches);

Note that in the above regular expression I have also moved the opening bracket to match the whole url, which is what you need in the code below.

Then in your loop, you can output the hyperlinks with this piece inside your quoted string:

    <a href="'.$matches[1][$i].'">

NB: You should start your code with <?php not just <?

I have added in my answer the things you need to do with your code to avoid repeated hyperlinks and how to output them. — trincot, Nov 09 '15 at 19:30

preg_match_all find links, remove same results?

1 Answers1