-1

Why Php Filters Malfunction ?

Q1. This is surely an invalid url: http://gvdvb.com/?mail=ali&bus=wertdomainn#wronghttp://gvdvb.com/?mail=ali&bus=wert(

The final character renders the above as an invalid url according to this url filter: https://regexr.com/39nr7

So how come php validates this invalid url ? I get no alert that the url I input in the url input field in the form is an invalid url.

Url Filter

<form method="POST" name="textfield" id="textfield" action="">
<fieldset>
<label for="url">Url:</label><br>
<input type="text" name="url" id="domain" maxlength="255" size="20">
<br>
<button type="submit">Submit Now!</button>
</form>

<?php


if($_SERVER['REQUEST_METHOD'] === 'POST')
{
    if(ISSET($_POST['url']))
    {
        if(!filter_input(INPUT_POST,'url',FILTER_VALIDATE_URL))
        {
            die('Invalid Url: ' .$_POST['url']); echo '<br>';
        }
        else
        {
            $url = $_POST['url'];
            $parse = parse_url($url);
            $domain = $parse['host'];
            
            if(!filter_var($domain,FILTER_VALIDATE_DOMAIN))
            {
                die("Invalid Domain: $domain"); echo '<br>';
            }
            echo 'Valid Domain: ' .$domain;
        }
    }
}

Note that, on the html form, I put the domain input field as:

<label for="url">Url:</label><br>
<input type="text" 

I did this deliberately so the html5 doesn't give the error. I wanted php to give the error instead as got to learn how to filter url with php.

Q2. Why doesn't php's domain filter work ? Whenever I input a invalid url with an invalid domain: xxx.

I get echoed: Valid Domain: LINE: 27

<form method="POST" name="textfield" id="textfield" action="">
<fieldset>
<label for="url">Url:</label><br>
<input type="text" name="url" id="domain" maxlength="255" size="20">
<br>
<button type="submit">Submit Now!</button>
</form>

<?php


if($_SERVER['REQUEST_METHOD'] === 'POST')
{
    if(ISSET($_POST['url']))
    {
        $url = $_POST['url'];
        $domain = parse_url($url,PHP_URL_HOST);
        
        if(!filter_input(INPUT_POST,'url',FILTER_VALIDATE_DOMAIN))
        {
            echo "Invalid Domain: $domain"; echo '<br>';
            echo 'LINE: ' . __LINE__;
        }
        else
        {
            echo 'Valid Domain: ' .$domain;
            echo 'LINE: ' . __LINE__;
        }
    }
}

Very puzzling!

  • 1
    What makes you think that the regex is correct and PHP's validate is wrong? The `(` is after a `#` isn't really a part of the actual URL. It's an anchor point for the browser (and won't even be sent to the server) – M. Eriksson Sep 13 '21 at 22:10
  • @Magnus Eriksson, Since it's a third party site gone online, it must be correct. – Nut Cracking Dude Sep 13 '21 at 22:13
  • 1
    I'm sorry, but that comment makes no sense. The regex you linked to is someone's regex snippet they've shared using regexr.com. I can write what ever regex there and share it. It won't mean it's automatically correct. I would trust PHP's validate filter way more than some random regex found online made by some unknown person. – M. Eriksson Sep 13 '21 at 22:16
  • @MagnusEriksson It's on the Internet, it MUST be true. – Barmar Sep 13 '21 at 22:19
  • @Barmar - Ofc... stupid me! :-) – M. Eriksson Sep 13 '21 at 22:19
  • @Magnus Erikkson, Ok then tell me, is this following url a valid url or not because according to me it is not. But I get echoed it's a valid url and a valid domain. Url: http://gvdvb.comptcu/?mail=ali&bus=wertdomainn(( – Nut Cracking Dude Sep 13 '21 at 22:19
  • There are many different URL regular expressions, depending on whether you're trying to validate a regexp or find it mixed into text. The latter tend to be more conservative about what they recognize in a URL. – Barmar Sep 13 '21 at 22:20
  • @Barmar, You telling me this is a valid domain ? gvdvb.comptcu – Nut Cracking Dude Sep 13 '21 at 22:21
  • Yes, it's valid to use parentheses in the query string: https://stackoverflow.com/a/13227416/2453432 – M. Eriksson Sep 13 '21 at 22:21
  • For instance, the matching regexp doesn't allow most punctuation, because you might write `(http://www.example.com)` and the `()` should not be included. – Barmar Sep 13 '21 at 22:21
  • It's not checking whether the domain exists, it's just checking syntax. There's nothing preventing a `.comptcu` domain from existing. – Barmar Sep 13 '21 at 22:22
  • @Barmar, I did not write any regex on my code to test this url which I deem invalid: gvdvb.comptcu. I just used php's filter: filter_var($domain,FILTER_VALIDATE_DOMAIN). – Nut Cracking Dude Sep 13 '21 at 22:23
  • @Barmar, I get your point about it checking syntax but mustn't php's interpreter have a list of valid TLDs that are currently in existance before it starts validating non-existing TLDs ? So all the php vendors did was get php to check if the url contains a dot or not and some chars or not before and after the dot and if it does then syntax is ok and so it's a valid domain even if the tld is non-existant ? I call this downright silly! So now how do I get php to verify if the TLDs exist or not ? Can you show me some code to enlighten me ? – Nut Cracking Dude Sep 13 '21 at 22:27
  • But you referred to a regexp at regexr.com. – Barmar Sep 13 '21 at 22:27
  • There are new TLD's added all the time. It wouldn't be feasible for PHP to keep an updated list. Plus, there are more TLD's floating around than the official ones. You can create a network with our own TLD's. How would PHP be able to handle those? PHP (and any regex) will only check for the correct formats. – M. Eriksson Sep 13 '21 at 22:29
  • These filters are just checking syntax, not whether the domain or URL actually exists. – Barmar Sep 13 '21 at 22:29
  • @Magnus, You are right for us not to trust some third party url filter. But I thought it was built by some pros like yourself. Anyway, I'll stick to your advice. Now, how-about answering my latest comments to Barmar since he's gone quiet ? – Nut Cracking Dude Sep 13 '21 at 22:29
  • @Magnas Erikkson, I get your point. yes, I am aware of the alternative name spaces. Ever since 2002. Anyway, do you mind editing my code as I'd like to learn your style and get my code to give me alert when the domain is wrong. Let;s start off by adding all the TLDs in an array and then checking the inputted url against the tld array to begin with. Now how to write this code ? I got no clue. Care to share ? Let's start-off by adding only .com, .net & .org in the array and let's test your upcoming code. – Nut Cracking Dude Sep 13 '21 at 22:33
  • There are strict rules for what is considered a URL and not. However, it's _way_ more complicated than _"check if the url contains a dot or not and some chars or not before and after the dot"_. You need to read up on what RFC's are, then you can search for any RFC regarding URL's to learn how complicated it actually is. – M. Eriksson Sep 13 '21 at 22:33
  • I'm sorry, but I'm not here to write your code for you. Do some actual research and make some attempts. If you run into some specific issue along the way, come back, show us what you've tried, the expected result and what you currently get. And I wasn't talking about namespaces, I was talking about TLD's (top level domains, such as .com, .net, .org etc) – M. Eriksson Sep 13 '21 at 22:35
  • If the php vendor had taught all the rfc stuffs to php's default filter then no need for us to write a custom function that abides by the rfc. Atleast agree with me here. – Nut Cracking Dude Sep 13 '21 at 22:36
  • @Barmar, Tell me atleast this. Is this line correct or not ? if(!filter_var($domain,FILTER_VALIDATE_DOMAIN)). Yes or no ? Or should I re-write it to: if(filter_var($domain,FILTER_VALIDATE_DOMAIN) ===FALSE). Note the 3 equal signs here. – Nut Cracking Dude Sep 13 '21 at 22:38
  • _"If the php vendor had taught all the rfc stuffs to php's default filter"_ - What RFC are you missing? I honestly don't understand what you're ranting about here. You have gotten your answer. Now go do some actual research into the matter. I'm out. – M. Eriksson Sep 13 '21 at 22:39
  • @Magnus Eriksson, I thought the RFCs would have a list of all valid TLDs at current and so I said if the php vendor stuck to the RFC guideline then they would have built-in the TLD list for their default DOMAIN FILTER. But now I understand that the RFCs don't have that list. All they say is something like there must be a dot and some chars at the right and atleast one char at the left and total there must not be more than 255 chars for the domain to be counted as a valid domain. Something like that. I get your point now. – Nut Cracking Dude Sep 13 '21 at 22:44
  • Please follow my advice and read up on what RFC's are and what they actually say. You seem to have a lot of misconceptions about multiple things. – M. Eriksson Sep 13 '21 at 22:45
  • @Magnus Erikkson, And if we need our php script to only validate existing TLDs then we must write our own custom function. I was hoping you'd write a mini one for my learning purpose and others' learning purposes. I still a beginner student. Still not at oop or pdo yet. Still struggling at procedural style. – Nut Cracking Dude Sep 13 '21 at 22:46
  • @Magnus Eriksson, I ain't bothered digging into countless lines of RFC text and doing my head-in. Got to come to some shotcut workaround without reading tonnes of RFC guidelines. – Nut Cracking Dude Sep 13 '21 at 22:47
  • @Magnus Eriksson, I get the vibe now that with php's default filter, there's no way we can validate a url let alone a domain. Tell me atleast this, is my code alright to validate url and domain ? – Nut Cracking Dude Sep 13 '21 at 22:49
  • @NutCrackingDude The only difference between the two ways to check is if an empty string could be a valid value. I don't think that's the case for any of the validations, so `!filter_var(...)` should be fine. – Barmar Sep 13 '21 at 22:57

2 Answers2

2

These filters just check whether the domain or URL is valid syntactically, according to the relevant standards.

FILTER_VALIDATE_DOMAIN checks that it's valid according to RFC 1034, RFC 1035, RFC 952, RFC 1123, RFC 2732, RFC 2181, and RFC 1123.

FILTER_VALIDATE_URL checks that it's valid according to RFC 2396.

They don't check that the domain actually exists, that the URL actually has a webserver, that the parameters are appropriate for the URL, etc.

So you can't use them to tell whether the URL is actually useful, just whether it looks like a URL.

Barmar
  • 741,623
  • 53
  • 500
  • 612
2

Lets break down your sample URI

http://gvdvb.com/?mail=ali&bus=wertdomainn#wronghttp://gvdvb.com/?mail=ali&bus=wert(

The protocol: http://

The domain: gvdvb.com

The query string: mail=ali&bus=wertdomainn

And finally, the page anchor: wronghttp://gvdvb.com/?mail=ali&bus=wert(

While I personally would never use an anchor like that, it is still OK.

The PHP filter_var() for FILTER_VALIDATE_URL is based on the RFC 2396 - Uniform Resource Identifiers (URI)

These rules allow for prviate domains to be constructed (something I used on internal or local only networks often) as well as the expansion of the public domains without the need to re-write the rules when ever there is change.

The flexibility of the RFC is more important than a "fixed" rule. This does mean you can not confirm if a domain is a valid as in exists or valid as in formatted correctly.

If you want a valid as in exists rule you will need to write your own lookup query.

As to your code, yes, the PHP used will check if the format of the input is a valid URL, but not if the URL is publicly accessible.

And let's think about it, domains come and go all the time. If you wanted to check if a URL is accessible today, are you going to check again tomorrow to make sure it is still accessible?

Tigger
  • 8,980
  • 5
  • 36
  • 40
  • xxx doesn't seem a valid domain to me because it has no tld. And so, I was expecting php's domain filter to flag this as invalid after checking the syntax and not checking if the domain exists or not. That's all. – Nut Cracking Dude Sep 21 '21 at 13:33
  • @NutCrackingDude [.xxx](https://en.wikipedia.org/wiki/.xxx) is a valid TLD, but the point remains. The RFC is written in a way to be flexible and expandable. PHP is correct to support the RFC and not "check" if a TLD or even a URL is available or accessible. You could write your own tool to do something like [`nslookup`](https://en.wikipedia.org/wiki/Nslookup) or [`dig`](https://en.wikipedia.org/wiki/Dig_(command)) or use one of the [PHP system level functions](https://www.php.net/manual/en/ref.exec.php) to access either of these DNS commands. – Tigger Sep 21 '21 at 22:20
  • I forgot to accept your answer the other day and so accepted it now. – Nut Cracking Dude Oct 04 '21 at 16:23