0

I need to extract Absolute URLs from source code. Now, here is the problem, i am extracting URLs for following:

>img tag SRC
>Script tag SRC (JS)
>CSS links

I'm using three different functions for each. The thing is that i sometimes get relative URLs, which are of no value since i have to further process them. Kindly review the following three functions and suggest improvements and corrections for how i can convert URLs to Absolute (after checking if they are not absolute already, of course).

thank you!

Function for extracting Image SRC.

function get_images(){
$images=array();
$regex='/[^(<!--)]<img [^>]*src=["|\']([^"|\']+(jpg|png|gif|jpeg))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
    $images[$key]=$value;
    }
    return $images;
}

Function for extracting JS links

function get_scripts(){
$script_links=array();
$regex='/<script [^>]*src=["|\']([^"|\']+(\.js))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
    $script_links[$key]=$value;
    }
    return $script_links;
}

Function for extracting CSS stylesheet links

function get_css(){
$css_links=array();
$regex='/<link [^>]*href=["|\']([^"|\']+(\.css))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
    $css_links[$key]=$value;
    }
    return $css_links;
}

Output i get when i use it on Google.com's source:

Array ( [0] => /images/icons/product/chrome-48.png [1] => http://www.google.com/images/hpp/pyramids-35.png ) 

Now this first link starts with /images/.... and is not reusable. This is the problem i'm trying to fix for all 3 types of sources.

i333
  • 13
  • 8
  • Looks like it's on your server, if so then you should already have the base URL for it. Either way, [you shouldn't parse HTML with Regexp](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), try using an XML or DOM parses instead like phpQuery for example. – casraf Oct 01 '14 at 14:51
  • It's not on my server. I'm using PHP-cURL to get the source code and then extract it using REGEX. Can you tell me why i shouldn't use Regex in PHP and use DOM parser? (my task scope was to use PHP, as much as possible, so i'm trying to stick to it) – i333 Oct 01 '14 at 15:53
  • Then, you have the URL of the server you're accessing. Just prepend that. As for parsing, it's just very unreliable, but I guess in small enough projects it might be okay. Depending how far you wanna go with this. – casraf Oct 01 '14 at 23:42
  • thanks for the suggestion. But can you further suggest how you think i can resolve links using a Regex may be? – i333 Oct 02 '14 at 18:39

0 Answers0