5

I have a very large file (about 20GB), how can I use fseek() to jump around and read its content.

The code looks like this:

function read_bytes($f, $offset, $length) {
    fseek($f, $offset);
    return fread($f, $length);
}

The result is only correct if $offset < 2147483647.

Update: I am running on windows 64, phpinfo - Architecture: x64, PHP_INT_MAX: 2147483647

anvoz
  • 722
  • 1
  • 9
  • 20

2 Answers2

4

WARNING: as noted in comments, fseek uses INT internally and it simply cant work with such large files on 32bit PHP compilations. Following solution wont work. It is left here just for reference.

a little bit of searching led me to comments on PHP manual page for fseek:

http://php.net/manual/en/function.fseek.php

problem is maximum int size for offset parameter but seems that you can work around it by doing multiple fseek calls with SEEK_CUR option and mix it with one of big numbers processing library.

example:

function fseek64(&$fh, $offset)
{
    fseek($fh, 0, SEEK_SET);
    $t_offset   = '' . PHP_INT_MAX;
    while (gmp_cmp($offset, $t_offset) == 1)
    {
        $offset     = gmp_sub($offset, $t_offset);
        fseek($fh, gmp_intval($t_offset), SEEK_CUR);
    }
    return fseek($fh, gmp_intval($offset), SEEK_CUR);
}

fseek64($f, '23456781232');
fsw
  • 3,595
  • 3
  • 20
  • 34
  • hmm, but if you are working on 64 bit system with 64bit PHP maybe the problem lays elsewhere. – fsw Jun 14 '13 at 18:28
  • 1
    this is also downvoted in the comments on php and I think that it might be because of this line `$offset = $offset - $t_offset;` the $offset has to be cast to int to resolve the right hand side of the assignment which cannot be above PHP_INT_MAX – Orangepill Jun 14 '13 at 18:40
  • I tried on 20GB file: `fseek64($f, 2200000000); // 2,200,000,000` `echo fread($f, 100);` Then I tried on a smaller part of that file, 2bil bytes, second part: `fseek($f, 200000000); // 200,000,000` `echo fread($f, 100);` Two results are different. – anvoz Jun 14 '13 at 18:56
  • as @Orangepill pointed out correctly this solution will not work on 32bit PHP. but concept stays. I think you can try multiple SEEK_CUR calls but you will have to mix above solution with one of libs to handle bignumbers http://stackoverflow.com/questions/211345/working-with-large-numbers-in-php – fsw Jun 14 '13 at 19:00
  • I changed my answer to something that 'should work' please test – fsw Jun 14 '13 at 19:08
  • @fsw thanks but still not work correctly. I tried but it returned same result like "xyz" for both $offset = 3bil and $offset = 4bil. The result for $offset < PHP_INT_MAX is correct. – anvoz Jun 14 '13 at 19:24
  • as I mentioned I haven't run it. this is just a concept. I am not so desperate to generate 20GB file and testing it ATM :) maybe the fseek implementation itself is a problem. you will have to debug it on your own adding some prints to see if you are indeed calling fseek few times. – fsw Jun 14 '13 at 19:37
  • Not sure if I could actually jump beyond 2147483647 or not. `ftell()` always tells 2147483647. `fread()` always reads as if offset is 2147483647. Maybe the position of the file pointer can not be greater than 2147483647. – anvoz Jun 15 '13 at 04:05
  • 5
    i tried this today, doesnt work on 32-bit systems, thats because internally fseek uses INT for storing current file pointer and INT cant go beyond 2G. – Tech Consultant Jul 09 '13 at 15:31
  • 1
    thanks for testing this out. so it is just like i was worried, added this to answer. – fsw Jul 09 '13 at 15:47
3

for my project, i needed to READ blocks of 10KB from a BIG offset in a BIG file (>3 GB). Writes were always append, so no offsets needed.

this will work, irrespective of which PHP version and OS you are using.

Pre-requisite = your server should support Range-retrieval queries. Apache & IIS already support this, as do 99% of other webservers (shared hosting or otherwise)

// offset, 3GB+
$start=floatval(3355902253);

// bytes to read, 100 KB
$len=floatval(100*1024);

// set up the http byte range headers
$opts = array('http'=>array('method'=>'GET','header'=>"Range: bytes=$start-".($start+$len-1)));
$context = stream_context_create($opts);
// bytes ranges header
print_r($opts);

// change the URL below to the URL of your file. DO NOT change it to a file path.
// you MUST use a http:// URL for your file for a http request to work
// this will output the results
echo $result = file_get_contents('http://127.0.0.1/dir/mydbfile.dat', false, $context);

// status of your request
// if this is empty, means http request didnt fire. 
print_r($http_response_header);

// Check your file URL and verify by going directly to your file URL from a web 
// browser. If http response shows errors i.e. code > 400 check you are sending the
// correct Range headers bytes. For eg - if you give a start Range which exceeds the
// current file size, it will give 406. 

// NOTE  - The current file size is also returned back in the http response header
// Content-Range: bytes 355902253-355903252/355904253, the last number is the file size

...

...

...

SECURITY - you must add a .htaccess rule which denies all requests for this database file except those coming from local ip 127.0.0.1.

thecoshman
  • 8,394
  • 8
  • 55
  • 77
Tech Consultant
  • 374
  • 1
  • 7
  • I tried this solution but the memory got exhausted. Does using `offset/maxlen parameters` or `Range header` help to prevent `file_get_contents` from reading the entire file into the memory? – anvoz Jul 10 '13 at 03:06
  • http range header queriesretrieve only the amount you requested, in this case 100KB, so no memory issue there. so the problem is with php script. if you reading 10MB, make sure you set php memory limit to double the amount. ini_set('memory_limit', '20M'); – Tech Consultant Jul 10 '13 at 06:58
  • I only read 200 bytes (`$len = floatval(200);`) of a 20GB file. My `memory_limit` is 1024M (I also tried a larger value to test). The request takes few minutes to load the file in memory then break with the exhausted memory error. – anvoz Jul 10 '13 at 08:52
  • "request takes few mins to load" thats shocking. i did this on my 3.5 gb file and it took a few ms, irrespective of whether i gave a big offset or a small offset. can you post this info print_r($http_response_header); it will show where the problem is with the http range request. – Tech Consultant Jul 10 '13 at 15:08
  • btw, i am on a 32-bit windows xp, using xampp (apache + php 5.3) and it works perfectly with http range request. pls share the response from this code - print_r($http_response_header); – Tech Consultant Jul 10 '13 at 15:12
  • Sorry I used `mydbfile.dat` instead of `http://127.0.0.1/dir/mydbfile.dat` that's why it's not work and `$http_response_header` was undefined. `$http_response_header ONLY gets populated using file_get_contents() when using a URL and NOT a local file`. Changed to a URL and your solution works like a charm. – anvoz Jul 11 '13 at 02:08