4

I am trying to parse URL's in PHP where the input could be any of the following:

Code:

$info = parse_url('http://www.domainname.com/');
print_r($info);

$info = parse_url('www.domain.com');
print_r($info);

$info = parse_url('/test/');
print_r($info);

$info = parse_url('test.php');
print_r($info);

Returns:

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [path] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => test.php
)

The problem you can see is the second example where the domain is returned as a path.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Matt
  • 537
  • 1
  • 5
  • 14

2 Answers2

11

This gives the right results but the file needs to start with a slash:

parse('http://www.domainname.com/');
parse('www.domain.com');
parse('/test/');
parse("/file.php");

function parse($url){
    if(strpos($url,"://")===false && substr($url,0,1)!="/") $url = "http://".$url;
    $info = parse_url($url);
    if($info)
    print_r($info);
}

and the result is :

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [scheme] => http
    [host] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => /file.php
)
Taha Paksu
  • 15,371
  • 2
  • 44
  • 78
  • just a quick one, how can I differentiate between a file name and domain name to append the leading slash? – Matt Apr 28 '12 at 01:06
  • check if there's any www preceeding it, but it may not be safe, checking it's extension - if you know all the file extension possibilities - would be better. counting the "."'s won't be safe either. – Taha Paksu Apr 28 '12 at 01:08
  • Well my code is scanning a page for links so there's no guarantee the link will have www or a subdomain or neither at all. Mammoth task if I need to check for all tld's! – Matt Apr 28 '12 at 01:13
  • 1
    If you are fetching urls from anchors in a web page, there's three possibilities: first, remote urls, they always start with "http://", second; "relative to root" urls, they always start with "/", third, "relative to current path" urls, they directly start with the path or file. You won't be running into "www.yourdomain.com" type urls in anchors. – Taha Paksu Apr 28 '12 at 01:15
  • Two more possibilities, first, inline page anchors, they start with "#", second: "javascript:" action href's. – Taha Paksu Jun 12 '20 at 06:43
0

To handle a URL in a way that preserves that it is was a schema-less URL, whilst also allowing a domain to be identified, use the following code.

if (!preg_match('/^([a-z][a-z0-9\-\.\+]*:)|(\/)/', $url)) {
    $url = '//' . $url;
}

So this will apply "//" to beginning of the URL only if the URL does not have a valid scheme and does not begin with "/".

Some quick background on this:

The parser assumes (valid) characters before ":" is the schema, whilst characters following "//" is the domain. To indicate the URL has both a scheme and domain, the two markers must be used consecutively, "://". For example

  • [scheme]:[path//path]
  • //[domain][/path]
  • [scheme]://[domain][/path]
  • [/path]
  • [path]

This is how PHP parses URLs with parse_url() but I couldn't say if it's to standard.

The rules for a valid scheme name is: alpha *( alpha | digit | "+" | "-" | "." )

Courtney Miles
  • 3,756
  • 3
  • 29
  • 47
  • preg_match(): Unknown modifier ')' – Shardj Jul 29 '20 at 10:24
  • @Shardj I'm afraid I can't replicate the error you have reported. Perhaps double check you have copied the expressions correctly. I suspect you have `(/)` in the expression instead of `(\/)`. – Courtney Miles Jul 29 '20 at 22:37