0

I am using a script to check links on a given page. I am using simple html DOM to parse the information into an array. I have to check the href of all the a tags to find if they contain a file or something like # or JS.

I tried the following without success.

if(preg_match("|^(.*)|iU", $href)){
    save_link();
}

I dont know it my pattern is wrong or if there is a better method to complete this function.

I want to be able to detect if $href contains .com .php .file extensions. This way it will filter out items like # "function()" and other items used in the href attribute.

EDIT: parse_url will not work stop posting it. The value # returns as a valid url like I stated above I am trying to look for any string followed by .* with no more than 4 chars following the .

James
  • 702
  • 2
  • 15
  • 39
  • paste some example hrefs.. did they start with file:// ? – Nelson Oct 12 '12 at 19:43
  • href="#" href="function()" href="http://www.site.com/file.php" – James Oct 12 '12 at 19:44
  • @James: `href` should not be `function()` (if you want JS code in there, use `href='javascript:jscode();'`, but better not to put it there at all (use an event handler instead). – Spudley Oct 12 '12 at 19:46
  • @James: for `site.com/file`, you should have `http://` on the front of that if it's a URL. – Spudley Oct 12 '12 at 19:47
  • I am creating a system that checks other sites I have to be prepaired for anything someone may enter. – James Oct 12 '12 at 19:55
  • [Would this do?](http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – BudwiseЯ Oct 12 '12 at 20:04

3 Answers3

0

You can use parse_url() , like this :

$res = parse_url($href);
if ( $res['scheme'] == 'http' ||  $res['scheme'] == 'https'){
    //valid url
    save_link();
}

UPDATE:
I've added code to filter only http and https urls, thanks to Baba for spotting this.

Nelson
  • 49,283
  • 8
  • 68
  • 81
  • `parse_url("htt#ps://google.com");` or `parse_url("sdsd");` would be valid with this solution see : http://codepad.viper-7.com/KIDl8N – Baba Oct 12 '12 at 19:49
  • @Baba so in that case "htt#ps://google.com" is a valid composed url. I have no problem with that. Remember you can have custom protocols, like elink: or magnet: , it seems the # is a valid character for a protocol. – Nelson Oct 12 '12 at 19:53
  • `htt#ps://google.com` is valid. It means file name `htt`, and anchor point within that file `ps://google.com`. This is valid. It may be unlikely to be what was intended, but it is valid. – Spudley Oct 12 '12 at 19:57
0

I believe that the function you're looking for is parse_url().

This function will take a URL string, and return an array of components, which will allow you to work out what kind of URL it is.

However note that it has issues with incomplete URLs in PHP versions prior to 5.4.7, so you need to have the very latest PHP to get the best out of it.

Hope that helps.

Spudley
  • 166,037
  • 39
  • 233
  • 307
  • Don't think that would work .. try `parse_url("javascript:alert('ok')"` see : http://codepad.viper-7.com/KIDl8N – Baba Oct 12 '12 at 19:52
  • @Baba: what's wrong with that? it's told you that `javascript` is the protocol. That's all you need to know. – Spudley Oct 12 '12 at 19:58
  • see question `find if they contain a file or something like # or JS` .. he does not want javascript – Baba Oct 12 '12 at 19:59
  • @Baba - uh... so if `parse_url()` says it's not http or https, then throw it away. The point is that `parse_url()` tells you about the string and what it contains; you can then use that information to decide what to do with it, so if you don't want a JS or a # fragment, you can drop it. But the question is asking how to get that information, and `parse_url()` is the way to do it. You need to do more than that one line to handle all possibilities, but that is where you need to start, and that's what he was asking for. – Spudley Oct 12 '12 at 20:06
  • I agree with @Spudley. you can test your scheme to see if it's javascript. At this point you only want http/https – Joshua Kaiser Oct 12 '12 at 20:11
0

See http://php.net/manual/en/function.parse-url.php

I'm assuming you don't want to match fragments (#) because you are not concerned with following internal anchors.

parse_url breaks up the different parts of the url into an array. You can see the path component of the URL in this array and run your check against that.

Joshua Kaiser
  • 1,461
  • 9
  • 17