0

I am collecting list of all urls from web page. My issue is, the list contains all images also which I dont want in my list of URLs.

This script gives me all link from web page.

function getUrl($html)
    {
        $regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
        preg_match_all($regex, $html, $matches);
        $urls = $matches[0];
        return $urls;
    }

Here is the regex to get image from source code.

/\bhttps?:\/\/\S+(?:png|jpg)\b/

How can I exclude image from list of extracted URLs?

UPDATE

$regex = '/(?!.*(?:\.jpe?g|\.gif|\.png)$)\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
        preg_match_all($regex, $html, $matches);
        $urls = $matches[0];

why this regex still could not exclue image?

user123
  • 5,269
  • 16
  • 73
  • 121
  • you can't tell reliably what's an image url from what isn't. urls don't **HAVE** to look like an image, especially if it's something like a direct link to a script which serves an image, e.g. `click here for pic`. you can probably safely assume anything in an `` is an image, but anything else is "good luck at that". – Marc B Jun 09 '14 at 14:08

1 Answers1

1

You probably want to use lookahead to make sure your line ends with the extension for an image, then manually remove that line from your list of matches. For example, to ensure a line of code ends with .png or .jpg, match it against:

/\.(?=(png|jpg)$)/

So for through your list of urls and clone them if they don't match that regex.

Edit: You actually don't even need lookahead, just try to match this:

\.(png|jpg)$

and discard matches

Devon Parsons
  • 1,234
  • 14
  • 23
  • did you mean checking each result, if they match with `\.(png|jpg)$` then discarding it. But I guess it will increase time. If I can update my actual regex. it will be faster. This regex should exclude image link `/(?!.*(?:\.jpe?g|\.gif|\.png)$)\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i` right? but still I am getting link with images – user123 Jun 09 '14 at 18:27