6

I have a 1.2GB file that contains a one line string. What I need is to search the entire file to find the position of an another string (currently I have a list of strings to search). The way what I'm doing it now is opening the big file and move a pointer throught 4Kb blocks, then moving the pointer X positions back in the file and get 4Kb more.

My problem is that a bigger string to search, a bigger time he take to got it.

Can you give me some ideas to optimize the script to get better search times?

this is my implementation:

function busca($inici){
        $limit = 4096;

        $big_one    = fopen('big_one.txt','r');
        $options    = fopen('options.txt','r');

        while(!feof($options)){
            $search = trim(fgets($options));
            $retro  = strlen($search);//maybe setting this position absolute? (like 12 or 15)

            $punter = 0;
            while(!feof($big_one)){
                $ara = fgets($big_one,$limit);

                $pos = strpos($ara,$search);
                $ok_pos = $pos + $punter;

                if($pos !== false){
                    echo "$pos - $punter - $search : $ok_pos <br>";
                    break;
                }

                $punter += $limit - $retro;
                fseek($big_one,$punter);
            }
            fseek($big_one,0);
        }
    }

Thanks in advance!

Marc
  • 289
  • 1
  • 3
  • 8
  • What happens when you use the plain strpos() function on the 1.2GB file? – powtac Jun 09 '10 at 22:27
  • I found a benchmark for different matching methods in PHP, but strpos() is the fastest. http://www.hashbangcode.com/blog/fastest-way-match-string-php-200.html – powtac Jun 09 '10 at 22:27
  • How big is options.txt and what does it look like? – 0scar Jun 09 '10 at 23:32
  • powtac, I don't open the 1.2GB at once, I open it by parts and it only waste 32MB of RAM (aprox). 0scar, the options.txt has 25.000.000 milion lines (and options to be 50M). – Marc Jun 10 '10 at 00:10
  • powtac, I had read some similar benchmarks about strpos, is for that, that I'm searching for an optimization in all the parts of the script. – Marc Jun 10 '10 at 00:11
  • to put your 1.2Gb into a database is not an option by some whim, I suppose – Your Common Sense Jun 10 '10 at 01:02
  • One problem I see with this is if your search string ($search) is split across two 4k blocks? Also, why don't you load all the strings in options.txt first, this saves you from scanning the 1.2GB file over and over. – bramp Jun 10 '10 at 01:27
  • How many lines are in options.txt, and what kind of values are on each line? I've just been reading up on multiple pattern search algorithms such as Rabin–Karp or Aho–Corasick string search. I'd quite like to try implementing one in PHP and compare it to your current solution. – bramp Jun 10 '10 at 01:45

2 Answers2

11

Why don't use exec + grep -b?

exec('grep "new" ext-all-debug.js -b', $result);
// here we have looked for "new" substring entries in the extjs debug src file
var_dump($result);

sample result:

array(1142) {
    [0]=>  string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"
    [1]=>  string(54) "3910:var tpl = new Ext.DomHelper.createTemplate(html);"
    ...
}

Each item consists of string offset in bytes from the start of file and the line itself, separated with colon.
So after this you have to look inside the particular line and append the position to the line offset. I.e.:

[0]=>  string(97) "3398: * insert new elements. Revisiting the example above, we could utilize templating this time:"

this means that "new" occurrence found at 3408th byte (3398 is the line position and 10 is the position of "new" inside this line)

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • +1. When you're dealing with files this large, it's better to leave this sort of work to tools that were built for the job. – Frank Farmer Jun 09 '10 at 23:12
  • I'm agree with the idea, but I need the correct way to launch grep. What's the correct sentence to search for a string inside a file with grep? can it returm me just the position of the match? Thanks – Marc Jun 10 '10 at 00:13
  • Thank you very much zerkms!, I gonna made benchmarks to tell you how it improve the performance. – Marc Jun 10 '10 at 12:38
  • @zerkms The problema I have now is with grep and the output, he give me all the entire line, and all I have in this hughe file is in one line, then he give me the number position with a very laaaaaaarge output that I can't manage. Know you how to just output the first position and then quit grep? (something like -q with output or -m 1 without the entire line). Thanks in advance – Marc Jun 11 '10 at 13:21
1
$big_one    = fopen('big_one.txt','r');
$options    = fopen('options.txt','r');  

while(!feof($options))
{
  $option = trim(fgets($options));
  $position = substr($big_one,$option);

  if($position)
    return $position; //exit loop
}

the size of the file is quite large though. you might want to consider storing the data in a database instead. or if you absolutely can't, then use the grep solution posted here.

Sev
  • 15,401
  • 9
  • 56
  • 75
  • maybe inserting it in blocks of 4Kb for example? that foreach is for split the string? or what? – Marc Jun 09 '10 at 22:16