0

I have this piece of text, and I want to extract links from this. Some links with have tags and some will be out there just like that, in plain format. But I also have images, and I don't want their links.

How would I extract links from this piece of text but ignoring image links. So basically and google.com should both be extract.

string(441) "<p class="fr-tag">Please visit&nbsp;https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a>&nbsp;should be filtered and this&nbsp;http://d.pr/i/1i2Xu&nbsp;<img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>"

I have tried the following but its incomplete:

    $dom = new DOMDocument();
    $dom->loadHTML($html);

    $tags = $dom->getElementsByTagName('a');
    foreach ($tags as $tag) {
    $hrefs[] =  $tag->getAttribute('href'); 
Ali Gajani
  • 14,762
  • 12
  • 59
  • 100

3 Answers3

1

I would try something like this.

Find and remove images tags:

$content = preg_replace("/<img[^>]+\>/i", "(image) ", $content); 

Find and collect URLs.

preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $content, $match);

Output Urls:

print_r($match);

Good luck!

luckyape
  • 722
  • 8
  • 22
  • Cool, thanks, but how would you incorporate this link . It says 3 on my html posted in the question, but it should only be 2. Since there are 2 links. Your code wont ignore the image. – Ali Gajani Jun 19 '15 at 13:37
1

Using just that one string to test, the following works for me:

$str =  '<p class="fr-tag">Please visit&nbsp;https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a>&nbsp;should be filtered and this&nbsp;http://d.pr/i/1i2Xu&nbsp;<img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';

preg_match('~a href="(.*?)"~', $str, $strArr);

Using a href ="..." in the preg_match() statement returns an array, $strArr containing two values, the two links to google.

Array
(
    [0] => a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg"
    [1] => https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg
)
the_pete
  • 822
  • 7
  • 19
  • It's an array that stores all values matching the string. http://php.net/manual/en/function.preg-match.php – the_pete Jun 19 '15 at 13:45
  • Nice, that seems to work, but do you think it will have any downsides. I am trying to build a link filter for a forum. And what part of the regex actually ignores the image tags? I want to know :) Also, your filter didn't extract this: http://d.pr/i/1i2Xu. I want plain links to be extracted too. That's the challenge. – Ali Gajani Jun 19 '15 at 13:54
  • For the most part, this should be an okay place to start. I didn't go into a lot of processing on the data within the array because each link you're processing from the full source could be different and require it's own unique set of regex or if statements. To ignore the `img` link, in the `preg_match()` the `a href="(.*?)"` code is looking for any text surrounded by `a href="..."` and as an `img` tag doesn't follow this format, we are able to skip it. – the_pete Jun 19 '15 at 14:33
-1

I played around with this a lot more and have an answer that may better suit what you are trying to do with a bit of "future proofing"

$str =  '<p class="fr-tag">Please visit&nbsp;www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a>&nbsp;should be filtered and this&nbsp;http://d.pr/i/1i2Xu&nbsp;<img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
$str = str_replace('&nbsp;',' ',$str);
$strArr = explode(' ',$str);
$len =  count($strArr);

for($i = 0; $i < $len; $i++){
    if(stristr($strArr[$i],'http') || stristr($strArr[$i],"www")){
        $matches[] = $strArr[$i];
    }
}

echo "<pre>";
print_r($matches);
echo "</pre>";

I went back and analyzed your string and noticed that if you translate the &nbsp; to spaces you can then explode the string into an array, step through that and if any elements contain http or www then add them to the $matches array to be processed later. The output is pretty clean and easy to work with and you also get rid of most of the html markup this way.

Something to note is that this probably isn't the best way to do this. I haven't tested with any other strings but the one you offered so there's optimization that can be done.

the_pete
  • 822
  • 7
  • 19