PHP Parse URL - Domain returned as path when protocol prefix not present

Question

I am trying to parse URL's in PHP where the input could be any of the following:

Code:

$info = parse_url('http://www.domainname.com/');
print_r($info);

$info = parse_url('www.domain.com');
print_r($info);

$info = parse_url('/test/');
print_r($info);

$info = parse_url('test.php');
print_r($info);

Returns:

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [path] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => test.php
)

The problem you can see is the second example where the domain is returned as a path.

Taha Paksu · Accepted Answer · 2012-04-28T00:29:02.903

11

This gives the right results but the file needs to start with a slash:

parse('http://www.domainname.com/');
parse('www.domain.com');
parse('/test/');
parse("/file.php");

function parse($url){
    if(strpos($url,"://")===false && substr($url,0,1)!="/") $url = "http://".$url;
    $info = parse_url($url);
    if($info)
    print_r($info);
}

and the result is :

Array
(
    [scheme] => http
    [host] => www.domainname.com
    [path] => /
)
Array
(
    [scheme] => http
    [host] => www.domain.com
)
Array
(
    [path] => /test/
)
Array
(
    [path] => /file.php
)

edited Apr 28 '12 at 00:29

answered Apr 28 '12 at 00:18

Taha Paksu

15,371
2
44
78

just a quick one, how can I differentiate between a file name and domain name to append the leading slash? – Matt Apr 28 '12 at 01:06
check if there's any www preceeding it, but it may not be safe, checking it's extension - if you know all the file extension possibilities - would be better. counting the "."'s won't be safe either. – Taha Paksu Apr 28 '12 at 01:08
Well my code is scanning a page for links so there's no guarantee the link will have www or a subdomain or neither at all. Mammoth task if I need to check for all tld's! – Matt Apr 28 '12 at 01:13
1

If you are fetching urls from anchors in a web page, there's three possibilities: first, remote urls, they always start with "http://", second; "relative to root" urls, they always start with "/", third, "relative to current path" urls, they directly start with the path or file. You won't be running into "www.yourdomain.com" type urls in anchors. – Taha Paksu Apr 28 '12 at 01:15
Two more possibilities, first, inline page anchors, they start with "#", second: "javascript:" action href's. – Taha Paksu Jun 12 '20 at 06:43

score 0 · Answer 2 · answered Nov 13 '17 at 06:15

To handle a URL in a way that preserves that it is was a schema-less URL, whilst also allowing a domain to be identified, use the following code.

if (!preg_match('/^([a-z][a-z0-9\-\.\+]*:)|(\/)/', $url)) {
    $url = '//' . $url;
}

So this will apply "//" to beginning of the URL only if the URL does not have a valid scheme and does not begin with "/".

Some quick background on this:

The parser assumes (valid) characters before ":" is the schema, whilst characters following "//" is the domain. To indicate the URL has both a scheme and domain, the two markers must be used consecutively, "://". For example

[scheme]:[path//path]
//[domain][/path]
[scheme]://[domain][/path]
[/path]
[path]

This is how PHP parses URLs with parse_url() but I couldn't say if it's to standard.

The rules for a valid scheme name is: alpha *( alpha | digit | "+" | "-" | "." )

@Shardj I'm afraid I can't replicate the error you have reported. Perhaps double check you have copied the expressions correctly. I suspect you have `(/)` in the expression instead of `(\/)`. — Courtney Miles, Jul 29 '20 at 22:37

PHP Parse URL - Domain returned as path when protocol prefix not present

2 Answers2