2

Let's say I want XML Files only with upto 10MB to be loaded from a remote server.

Something like

$xml_file = "http://example.com/largeXML.xml";// size= 500MB

//PRACTICAL EXAMPLE: $xml_file = "http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml";// size= 683MB

 /*GOAL: Do anything that can be done to hinder this large file from being loaded by the DOMDocument without having to load the File n check*/

$dom =  new DOMDocument();

$dom->load($xml_file /*LOAD only IF the file_size is <= 10MB....else...echo 'File is too large'*/);

How can this possibly be achieved?.... Any idea or alternative? or best approach to achieving this would be highly appreciated.

I checked PHP: Remote file size without downloading file but when I try with something like

var_dump(
    curl_get_file_size(
        "http://www.dailymotion.com/rss/user/dialhainaut/"
    )
);

I get string 'unknown' (length=7)

When I try with get_headers as suggested below, the Content-Length is missing in the headers, so this will not work reliably either.

Please kindly advise how to determine the length and avoid sending it to the DOMDocument if it exceeds 10MB

Community
  • 1
  • 1
ErickBest
  • 4,586
  • 5
  • 31
  • 43
  • Did you look at [filesize()](http://php.net/manual/en/function.filesize.php) function? – Mawia HL Apr 21 '16 at 06:38
  • @MawiaHL Can you try: `var_dump(filesize("http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml"))` – ErickBest Apr 21 '16 at 06:44
  • Page not found is the result. – Mawia HL Apr 21 '16 at 06:48
  • @MawiaHL -- This loads in the browser: https://www.w3.org/TR/2001/REC-xsl-20011015/xslspec.xml .... but doesn't work with `filesize()`..... in `var_dump(filesize('https://www.w3.org/TR/2001/REC-xsl-20011015/xslspec.xml'))` – ErickBest Apr 21 '16 at 06:50
  • @DownVoters.... Please advise what is wrong with the Question. Thank You! – ErickBest Apr 21 '16 at 08:04

3 Answers3

2

Ok, finally working. The headers solution was obviously not going to work broadly. In this solution, we open a file handle and read the XML line by line until it hits the threshold of $max_B. If the file is too big, we still have the overhead of reading it up until the 10MB mark, but it's working as expected. If the file is less than $max_B, it proceeds...

$xml_file = "http://www.dailymotion.com/rss/user/dialhainaut/";
//$xml_file = "http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml";

$fh = fopen($xml_file, "r");  

if($fh){
    $file_string = '';
    $total_B = 0;
    $max_B = 10485760;
    //run through lines of the file, concatenating them into a string
    while (!feof($fh)){
        if($line = fgets($fh)){
            $total_B += strlen($line);
            if($total_B < $max_B){
                $file_string .= $line;
            } else {
                break;
            }
        }
    } 

    if($total_B < $max_B){
        echo 'File ok. Total size = '.$total_B.' bytes. Proceeding...';
        //proceed
        $dom = new DOMDocument();
        $dom->loadXML($file_string); //NOTE the method change because we're loading from a string   

    } else {
        //reject
        echo 'File too big! Max size = '.$max_B.' bytes.';  
    }

    fclose($fh);

} else {
    echo '404 file not found!';
}
larsAnders
  • 3,813
  • 1
  • 15
  • 19
  • This Crashes when tested with: `file_get_contents("http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml"); size //683MB ` ... Kindly advise – ErickBest Apr 21 '16 at 07:11
  • Script Hanged had to restart the server... `file_get_contents` tries to load the entire `683 MB` into memory before its worked on – ErickBest Apr 21 '16 at 07:13
  • 1
    Yeah, we're banging up against the maximum size for a single variable it's the `memory_limit` setting in php.ini. We need a better solution - one that can test the file without loading the whole thing. – larsAnders Apr 21 '16 at 07:16
  • Nice try!.. however, these files are random from random servers. ...This fails on something like: ` $xml_file = "http://www.dailymotion.com/rss/user/dialhaina‌​ut/";` `$head = array_change_key_case(get_headers($xml_file, TRUE));` The `headers` do not include `Content-Length`.... please see **@Mawia HL** 's answer... Thanks at least for trying.. :) – ErickBest Apr 21 '16 at 07:31
  • Ok, I've added some simple error handling. You can update precisely how errors are handled as you like. But this does answer the question - if the file exists, and it's too big, you can now determine that without downloading it. – larsAnders Apr 21 '16 at 07:40
  • ___That's really great!.... I'll give it a thought. Though just wondering for `dailymotion.com` **users** intending to load their `xml` contents will always get the `ERROR`.... .... for a LINK like: `dailymotion.com/rss/user/dialhaina‌ut/` ... shows the `XML` in the `Webrowser` but the `headers` don't include `'Content-length'` ... but thanks alot – ErickBest Apr 21 '16 at 07:45
  • `get_headers()` does a full GET request, so you need to provide a stream context and change the method to HEAD to prevent downloading the thing just to get it's size. See example #2 in the PHP manual. – Gordon Apr 21 '16 at 08:04
  • Ok, finally working for daily motion, and also rejecting the oversize files. Check it out. – larsAnders Apr 21 '16 at 08:39
  • @larsAnders .... So far your suggestion has worked perfectly with multiple tests. You deserve **"Un Coup the Chapeau"** un-less a different solution is suggested... but this works very-well... Merci Beaucoup! – ErickBest Apr 21 '16 at 09:07
  • 1
    That's great! Thanks for the challenge - for sure the most interesting question of the day. – larsAnders Apr 21 '16 at 09:09
  • @larsAnders you're awesome buddy... if you like the question, please upVote it. Thax again! – ErickBest Apr 21 '16 at 09:12
  • 3
    I would *still* check whether the headers are there, because if they are, it'll save us the trouble of potentially downloading 10MB of garbage to decide that it's garbage. – Madara's Ghost Apr 21 '16 at 09:20
1

10MB is equal to 10485760 B. If content-length is not specified, it will use curl which is available since php5. I got this source from somewhere in SO but could not remember it.:

function get_filesize($url) {
    $headers = get_headers($url, 1);
    if (isset($headers['Content-Length'])) return $headers['Content-Length'];
    if (isset($headers['Content-length'])) return $headers['Content-length'];
    $c = curl_init();
    curl_setopt_array($c, array(
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HTTPHEADER => array('User-Agent: Mozilla/5.0 
         (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) 
          Gecko/20090824 Firefox/3.5.3'),
        ));
    curl_exec($c);
    return curl_getinfo($c, CURLINFO_SIZE_DOWNLOAD);
    }
}
    $filesize = get_filesize("http://www.dailymotion.com/rss/user/dialhainaut/");
    if($filesize<=10485760){
        echo 'Fine';
    }else{
       echo $filesize.'File is too big';
    }    

.

Check demo here

Mawia HL
  • 3,605
  • 1
  • 25
  • 46
  • @Mawai HL--- We have tried that before it fails when used on this XML: `$head = array_change_key_case(get_headers("http://www.dailymotion.com/rss/user/dialhainaut/", TRUE));` The headers do not include `Content-Length`.... please try n advise. Thx – ErickBest Apr 21 '16 at 07:07
  • @ErickBest, http://www.dailymotion.com/rss/user/dialhaina‌​ut/ does not return anything. It only returns `Page not found The page you're looking for is either restricted or doesn't exist`. So how can anyone know the size of the file when it does not exist at all. – Mawia HL Apr 21 '16 at 07:28
  • --Please try this in the Web-Browser: http://www.dailymotion.com/rss/user/dialhainaut/ – ErickBest Apr 21 '16 at 07:35
  • If there is no content at all, get_headers won't return anything. – Mawia HL Apr 21 '16 at 07:36
  • Sure... makes sense. but when run in the browser the XML shows up... if it is larger than `10MB` it should not be loaded to the `DOMDocument` – ErickBest Apr 21 '16 at 07:40
  • `get_headers()` does a full GET request, so you need to provide a stream context and change the method to HEAD to prevent downloading the thing just to get it's size. See example #2 in the PHP manual. – Gordon Apr 21 '16 at 08:02
  • 1
    @Mawia HL ... Yes... **Confirmed** ..., Working perfectly ... _(Wish I could have 2 Accepted Answers)_ – ErickBest Apr 21 '16 at 09:28
  • @Mawia HL The `code` in the demo works with multiple tests... but not the ones in your answer. Can you past the `demo codes` as your answer? – ErickBest Apr 21 '16 at 09:34
  • 1
    @ErickBest, The advantage of using my answer over the accepted is that some host providers disabled fopen() function. On a Windows webserver, when using fopen with a file path stored in a variable, PHP will return an error if the variable isn't encoded in ASCII. When using SSL, Microsoft IIS will violate the protocol by closing the connection without sending a close_notify indicator. So Curl is better. – Mawia HL Apr 21 '16 at 12:47
-1

Edit: New Answer a bit workaroundish:
You can't check the Dom Elements Length, BUT, you can make a header request and get the filesize from the URL:

<?php

function i_hope_this_works( $XmlUrl ) {
    //lets assume we fk up so we set size to -1  
    $size = -1;

      $request = curl_init( $XmlUrl );

      // Go for a head request, so the body of a 1 gb file will take the same as 1 kb
      curl_setopt( $request, CURLOPT_NOBODY, true );
      curl_setopt( $request, CURLOPT_HEADER, true );
      curl_setopt( $request, CURLOPT_RETURNTRANSFER, true );
      curl_setopt( $request, CURLOPT_FOLLOWLOCATION, true );
      curl_setopt( $request, CURLOPT_USERAGENT, get_user_agent_string() );

      $requesteddata = curl_exec( $request );
      curl_close( $request );

      if( $requesteddata ) {
        $content_length = "unknown";
        $status = "unknown";

        if( preg_match( "/^HTTP\/1\.[01] (\d\d\d)/", $requesteddata, $matches ) ) {
          $status = (int)$matches[1];
        }

        if( preg_match( "/Content-Length: (\d+)/", $requesteddata, $matches ) ) {
          $content_length = (int)$matches[1];
        }

        // you can google status qoutes 200 is Ok for example
        if( $status == 200 || ($status > 300 && $status <= 308) ) {
          $result = $content_length;
        }
      }

      return $result;
    }
    ?>

You should now be able to get every Filesize you want by URL just with

$file_size = i_hope_this_works('yourURLasString')
  • RESULT: `Warning: Illegal string offset 'size' in C:\.....\fileSize_tst\index.php on line 5` – ErickBest Apr 21 '16 at 06:53
  • What is the value of size ? – Johannes Geidel Apr 21 '16 at 06:53
  • The Size is unknown....can be anything size.... but must not be `> 10MB`... the File comes from a remote server..(Please read the question ones more) – ErickBest Apr 21 '16 at 06:55
  • I mean what is in the variable size – Johannes Geidel Apr 21 '16 at 06:56
  • @johannes-- you added the variable `size` in your answer and I said that there is no way the variable `size` can be used bcz the file is loaded from a remoteServer with unknown `size`.... – ErickBest Apr 21 '16 at 07:03
  • Actually size will be generated from PHP, for you, but i don't know if it works with a remote server since I don't use it alot, i will edit my answer to have more specific Code ... That at least works for me – Johannes Geidel Apr 21 '16 at 07:09
  • 1
    @johannes-- The file is not being `uploaded` it is already uploaded somewhere in a different server, we simply want to read it using the `DOMDocument` but we want to **STOP reading** ...or ...**NOT-READ at all** if its size exceeds `10MB`... i.e: FILES LIKE: `http://www.cs.washington.edu/research/xmldatasets/data/pir/psd7003.xml` ......or..... `http://www.cs.washington.edu/research/xmldatasets/data/SwissProt/SwissProt.xml` are just too large – ErickBest Apr 21 '16 at 07:19
  • Try int filesize ( string $filename ) filename is the path to the file, from which Server are you getting the files? – Johannes Geidel Apr 21 '16 at 07:34
  • __From Multiple Servers... The Servers are extremely random. but the Main GOAL is we do whatever we can so that `NO FILE` with `size` beyond `10MB` should be passed to the `DOMDocument` – ErickBest Apr 21 '16 at 07:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109773/discussion-between-johannes-geidel-and-erickbest). – Johannes Geidel Apr 21 '16 at 08:47
  • This approach failed ... returns `unknown`... when tested with `http://www.dailymotion.com/rss/user/dialhainaut/` – ErickBest Apr 21 '16 at 09:45