0

I have a form that has fields for a couple of URLS. I wrote a Zend Framework validator that does a trivial preg_match to screen out ridiculous strings, and then does a curl HEAD request (CURLOPT_NOBODY) to screen out 404's and other connectivity issues. In testing I came across the mysterious return code 0 with "unknown SSL protocol error", so I added a check to accept as valid anything that gave a message with "SSL" in it, since that would suggest that the URL reached a webserver.

But one particular URL that our customers would likely use in practice redirects to an s3.amazonaws.com URL for a PDF file. In a browser, both the original URL, and the s3 URL it redirects to, display the PDF just fine. Since I used CURLOPT_FOLLOWLOCATION, I expected my validator would accept it. But instead it gave a 404. I then tried specifying the s3 URL directly, and that gave a 403(!). Thinking that possibly the 403 was triggered by the fact that I had specified a header of 'HTTP_X_REQUESTED_WITH: XMLHttpRequest', I commented out that line in the code. But it still gave a 403.

How can this happen? It seems to me that amazon s3 would have to look for HEAD requests explicitly, and deliberately issue a 404 or 403 depending on whether it came via a redirect???

I suppose I could delete the CURLOPT_NOBODY to have it send a GET request, but that seems silly since I don't care about the body.

Here is my complete code:

<?php

class Oshk_ZendX_Validate_Url {
    static $debug = true;
    // Based on https://stackoverflow.com/a/42619410/467590
    const PATTERN = '/^(https?:\/\/)?[^" ]+(\.[^" ]+)*$/';

    public static function isValid($value) {
        $STDERR = fopen("php://stderr", "w");
        $value = (string) $value;
        $matches = array();
        if (! preg_match(self::PATTERN, $value, $matches)) {
            fwrite($STDERR, sprintf("File '%s', line %d, value '%s' does not match pattern '%s'\n", __FILE__, __LINE__, $value, self::PATTERN));
            fclose($STDERR);
            return false;
        }
        if (! array_key_exists(1, $matches)) {
            $value = "https://$value";
        }
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s', \$matches = %s", __FILE__, __LINE__, $value, print_r($matches, true)));
        }
        // URL looks well-formed. Ask curl to send a HEAD request to it
        $ch = curl_init($value);
        if ($ch === false) {
            throw new Exception("curl_init($value) failed!");
        }
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_HEADER, 0); // From https://www.php.net/manual/en/curl.examples-basic.php
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('HTTP_X_REQUESTED_WITH: XMLHttpRequest'));
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36');
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        if (self::$debug) {
            curl_setopt($ch, CURLOPT_VERBOSE, true);
            curl_setopt($ch, CURLOPT_STDERR, $STDERR);
        }
        $data = curl_exec($ch);
        $msg = curl_error($ch);
        $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        if (self::$debug) {
            // https://stackoverflow.com/a/14436877/467590
            $allinfo = curl_getinfo($ch);
            fwrite($STDERR, sprintf("File '%s', line %d, \$allinfo = %s\n", __FILE__, __LINE__, print_r($allinfo, true)));
        }
        curl_close($ch);
        if (self::$debug) {
            fwrite($STDERR,  sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
        }
        if(! strlen($data) && $status != 0 && false === strpos($msg, 'SSL')) {
            fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
            fclose($STDERR);
            return false;
        }
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, url = '%s'\n", __FILE__, __LINE__, $value));
            fwrite($STDERR, sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
        }
        unset($data);
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, \$msg = '%s'\n", __FILE__, __LINE__, $msg));
            fwrite($STDERR, sprintf("File '%s', line %d, \$status = '%s'\n", __FILE__, __LINE__, $status));
            fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s'\n", __FILE__, __LINE__, $value));
        }
        if (($status >= 100 & $status < 400) || false !== strpos($msg, 'SSL')) {
            fclose($STDERR);
            return true;
        }
        fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
        fclose($STDERR);
        return false;
    }
}

echo var_dump(Oshk_ZendX_Validate_Url::isValid($argv[1]));

Here is the bash shell session running it with the original URL:

$ php curltest.php 'https://americandrivingsociety.org/docs.ashx?id=1037680'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://americandrivingsociety.org/docs.ashx?id=1037680', $matches = Array
(
        [0] => https://americandrivingsociety.org/docs.ashx?id=1037680
        [1] => https://
)
*   Trying 208.66.171.71:443...
* Connected to americandrivingsociety.org (208.66.171.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
    CApath: none
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=americandrivingsociety.org
*  start date: Sep  2 00:00:00 2022 GMT
*  expire date: Oct  3 23:59:59 2023 GMT
*  subjectAltName: host "americandrivingsociety.org" matched cert's "americandrivingsociety.org"
*  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
*  SSL certificate verify ok.
> HEAD /docs.ashx?id=1037680 HTTP/1.1
Host: americandrivingsociety.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest

* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
* The requested URL returned error: 404 Not Found
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
        [url] => https://americandrivingsociety.org/docs.ashx?id=1037680
        [content_type] =>
        [http_code] => 404
        [header_size] => 0
        [request_size] => 250
        [filetime] => -1
        [ssl_verify_result] => 0
        [redirect_count] => 0
        [total_time] => 0.132769
        [namelookup_time] => 0.009406
        [connect_time] => 0.035694
        [pretransfer_time] => 0.090879
        [size_upload] => 0
        [size_download] => 0
        [speed_download] => 0
        [speed_upload] => 0
        [download_content_length] => -1
        [upload_content_length] => -1
        [starttransfer_time] => 0.132714
        [redirect_time] => 0
        [redirect_url] =>
        [primary_ip] => 208.66.171.71
        [certinfo] => Array
                (
                )

        [primary_port] => 443
        [local_ip] => 16.1.1.151
        [local_port] => 55977
        [http_version] => 2
        [protocol] => 2
        [ssl_verifyresult] => 0
        [scheme] => HTTPS
        [appconnect_time_us] => 90757
        [connect_time_us] => 35694
        [namelookup_time_us] => 9406
        [pretransfer_time_us] => 90879
        [redirect_time_us] => 0
        [starttransfer_time_us] => 132714
        [total_time_us] => 132769
)

File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://americandrivingsociety.org/docs.ashx?id=1037680' gives bad status code 404 when accessed, with message 'The requested URL returned error: 404 Not Found'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)

repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$

Here's the same thing using the s3 URL it redirects to:

    $ php curltest.php 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D', $matches = Array
(
        [0] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
        [1] => https://
)
*   Trying 52.216.56.0:443...
* Connected to s3.amazonaws.com (52.216.56.0) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
    CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=s3.amazonaws.com
*  start date: Apr 11 00:00:00 2023 GMT
*  expire date: Dec 20 23:59:59 2023 GMT
*  subjectAltName: host "s3.amazonaws.com" matched cert's "s3.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
> HEAD /ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D HTTP/1.1
Host: s3.amazonaws.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest

* Mark bundle as not supporting multiuse
* The requested URL returned error: 403 Forbidden
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
        [url] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
        [content_type] =>
        [http_code] => 403
        [header_size] => 0
        [request_size] => 523
        [filetime] => -1
        [ssl_verify_result] => 0
        [redirect_count] => 0
        [total_time] => 0.128771
        [namelookup_time] => 0.027331
        [connect_time] => 0.043198
        [pretransfer_time] => 0.107906
        [size_upload] => 0
        [size_download] => 0
        [speed_download] => 0
        [speed_upload] => 0
        [download_content_length] => -1
        [upload_content_length] => -1
        [starttransfer_time] => 0.128721
        [redirect_time] => 0
        [redirect_url] =>
        [primary_ip] => 52.216.56.0
        [certinfo] => Array
                (
                )

        [primary_port] => 443
        [local_ip] => 16.1.1.151
        [local_port] => 56277
        [http_version] => 2
        [protocol] => 2
        [ssl_verifyresult] => 0
        [scheme] => HTTPS
        [appconnect_time_us] => 107740
        [connect_time_us] => 43198
        [namelookup_time_us] => 27331
        [pretransfer_time_us] => 107906
        [redirect_time_us] => 0
        [starttransfer_time_us] => 128721
        [total_time_us] => 128771
)

File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D' gives bad status code 403 when accessed, with message 'The requested URL returned error: 403 Forbidden'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)

repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$
sootsnoot
  • 2,178
  • 3
  • 22
  • 27

1 Answers1

0

I added a check to accept as valid anything that gave a message with "SSL" in it

This seems dangerous. What if the error message is "Invalid SSL certificate"?

since that would suggest that the URL reached a webserver

This true of any response -- 300, 400, 500, whatever. If your connection didn't timeout, then you've successfully connected to something, regardless of the status code. I.e., by this logic, if "reaching a webserver" is what you're validating, then only a timeout should fail.

I suppose I could delete the CURLOPT_NOBODY to have it send a GET request, but that seems silly since I don't care about the body.

You can't expect that every URL will be successfully reachable via a HEAD request, or that the results of HEAD request will always be the same as the results of a GET request.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

Don't do this. If the verify fails, you want the request the fail, that's the whole point of SSL.

Overall, if you're not going to validate the actual content of the page, then I don't think it makes any sense to even make the request. Just validate the syntax of the URL. Otherwise, you're going to fail on things like transient network errors, maintenance downtimes, ad blockers, IP-based filtering, etc. You've got acres of code for what should just be one line:

class Oshk_ZendX_Validate_Url {
    public static function isValid(string $url): bool
    {
        return (bool) filter_var($url, FILTER_VALIDATE_URL);
    }
}

If you want to also test the connection and make sure there's a live server answering the request at the time of form submission, then the status doesn't really matter, and you can just check for a non-false return value from the HTTP wrapper via file_get_contents():

class Oshk_ZendX_Validate_Url {
    public static function isValid(string $url): bool
    {
        return filter_var($url, FILTER_VALIDATE_URL) &&
            file_get_contents($url) !== false;
    }
}
Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
  • Your assertion that any response code indicates reaching something is blatantly useless for this purpose. For example, DNS lookup failures are a likely problem for user-typed URLs that should definitely be detected before being stored in a database to be displayed to other users. That's the whole point of doing something more than lexical analysis. Even if the failure is invalid SSL certificate, that's really more of a transient problem that the owner of the server is *likely* to fix. – sootsnoot May 09 '23 at 16:33
  • 1
    If you have a DNS lookup failure, then you won't get a response code. So... getting a response code means there was no DNS lookup failure. If you want to actually test the connection and make sure there's a live server answering the request at the time of form submission, then just check for a non-false `file_get_contents($url)` -- there's no need for all this fake AJAX and status wrangling. – Alex Howansky May 09 '23 at 16:41
  • On the other hand, 404 and 403 are likely user errors - though a change on the server could correct them, it would be totally by accident. Enabling VERIFYPEER just creates a maintenance issue for the validator, in that I would have to keep CA certificates/bundles updated. And it's not super helpful in eliminating URL's to be saved in the database and presented to another user because a URL with an invalid certificate would be detected by the user's browser, letting them decide whether or not they really want to visit the site. – sootsnoot May 09 '23 at 16:42
  • I appreciate that it's too much code for the job at hand, but most of it is debugging triggered by the mysterious unknown SSL protocol error. I don't see any fake ajax. If I can't use HEAD (since it doesn't work for URLs of practical interest to my user base), then I guess I could try file_get_contens(). Some posts here say that it's significantly slower than curl, but I don't think that's s big deal for validating a couple of fields on a form. But I'll need to chewck it with these particular URLs. Thanks. – sootsnoot May 09 '23 at 16:48
  • BTW, for server blurfo.blurfo, this code returns 'https\://blurfo.blurfo' gives bad status code 0 when accessed, with message 'Could not resolve host: blurfo.blurfo'. That's what I meant by DNS lookup failure. – sootsnoot May 09 '23 at 16:51
  • _"That's what I meant by DNS lookup failure."_ Sure, that's expected. But note that the status code is provided by the remote server, so _any_ status code at all indicates that a remote server was successfully contacted. I.e., if all you want to know is, "did a web server answer this request," then you can do that simply by checking for a non-zero status. (And turning off CURLOPT_FOLLOWLOCATION so you don't make extra requests.) – Alex Howansky May 09 '23 at 17:05
  • _"I don't see any fake ajax."_ This was a reference to your use of the `HTTP_X_REQUESTED_WITH: XMLHttpRequest` header, which emulates an AJAX request in a non-AJAX context. – Alex Howansky May 09 '23 at 17:07
  • Yeah, that was a leftover from experimenting with all kinds of CURLOPT values. BTW, looks like you didn't thoroughly test your suggestion. Using file_get_contents(), the original URL, returns true, but the s3 URL to which it redirects, still fails. Though for my purposes, that's good enough, users filling out the form are not going to enter the s3 URL. – sootsnoot May 09 '23 at 17:21
  • The CURLOPT_FOLLOWLOCATION is essential, since redirection to a non-existent resource needs to be detected. And of course file_get_contents() also follows redirects. – sootsnoot May 09 '23 at 17:23
  • Some further testing with URLs of interest suggests that file_get_contents() does the job very nicely. If you want to post that as an answer, I'll accept it. – sootsnoot May 09 '23 at 17:34
  • 1
    (Updated answer text.) – Alex Howansky May 09 '23 at 17:41
  • I accepted your answer, but note that it isn't really a replacement for the code I posted, since filter_var doesn't provide the substring matching of the original preg_match that accepts strings like 'google.com' and puts a 'https://' prefix on them. Actually I changed to prefix with 'http://', expecting that any decent server will redirect to https if they support it. – sootsnoot May 09 '23 at 17:47