2

I need to detect files which contain my string. Files sizes can be bigger than 4gb. I cannot do that simply using tools like file_get_contents() because it try to put file in RAM.

How can I do this? Using standard PHP? Using elasticsearch or other external search engine?

Bartłomiej Semańczyk
  • 59,234
  • 49
  • 233
  • 358
  • You may read it line by line, but it depends what these failes conatain. Can you tell us more about your file content (csv, xml, random text, 1line text....), and the kind of string you're looking for ? – Random Mar 04 '16 at 16:32
  • Create a loop over the .txt file content and use `IF` statements to filter out what you want to see. – yardie Mar 04 '16 at 16:38
  • This seems like a better option: https://stackoverflow.com/questions/3686177/php-to-search-within-txt-file-and-echo-the-whole-line – Mike Q Aug 24 '17 at 17:31

3 Answers3

5

If you have a linux based machine, you can use grep command:

shell_exec( 'grep "text string to search" /path/to/file');

As output you will have all the rows containing your text.

here you can find an easy tutorial for using grep!

If you need to find all files containing some text in a directory, you can use

shell_exec( 'grep -rl "text string to search" /path/to/dir' );

r stands for "recursive", so it will look in every file

l stands for "show filenames"

As a result, you will have all filenames (one per row).

Full
  • 455
  • 4
  • 13
  • Thanks! It's a good solution for Linux and I would use it for my projects, but now I have a task to run this code on Linux and Windows. The best decision I know now is to divide file into parts using `fopen`, `fseek`, `fgets` and search in partials. – Роман Слободзян Mar 04 '16 at 18:56
  • 1
    I know it's system dependant, so not best practice, but you can use windows equivalent findstr (https://technet.microsoft.com/en-us/library/bb490907.aspx), if you care about time (native functions are surely faster than reading files line to line!) – Full Jun 03 '16 at 11:10
5

You may use something like this. This is not optimized or tested at all, and may have some unnoticed bug by me, but you should get the idea:

function findInFile($file_name, $search_string, $chunk_size=1024) {
    // Because we are going to look back one chunk at a time,
    // having $search_string more than twice of chunks will yield
    // no result.
    if (strlen($search_string) > 2 * $chunk_size) {
        throw new \RuntimeException('Size of search string should not exceed size of chunk');
    }
    $file = new \SplFileObject($file_name, 'r');
    $last_buffer = '';
    while (!$file->eof()) {
        $chunk = $file->fread($chunk_size);
        $buffer = $last_buffer . $chunk;
        $position_in_buffer = strstr($buffer, $search_string);
        if ($position_in_buffer !== false) {
            // Return position of string in file
            return
                $file->ftell() - strlen($chunk) + $position_in_buffer
            ;
        }
        $last_buffer = $chunk;
    }
    return null;
}
vfsoraki
  • 2,186
  • 1
  • 20
  • 45
3

file_get_contents return contents of whole file as variable. In your case it means it will try to create 4GB variable which exhausts allowed memory.

Try using fopen and fgets. This will allow you to process file in smaller chunks.

Give it a try! :)

Shashank Shah
  • 2,077
  • 4
  • 22
  • 46