How to filter out data from big file on Linux?

Question

I've two files, urls.log(1Gb) and ids.txt(20M), the first file urls.log look like this:

/product/80x80/436284940/
/product/100x100/1051907917/Pavillon-2.jpg
/product/140x140/988563549/LITTLE-ROSE-Mikrofasermischung-Maxi-Slips-uni-5er-Pack.jpg
/product/100x100/504170379/Dunlop-SP-Sport-Maxx-215-40R17-87V-XL-VW1-MFS.jpg
...

The second file ids.txt look like this:

The result should be: (result.txt)

/product/140x140/988563549/LITTLE-ROSE-Mikrofasermischung-Maxi-Slips-uni-5er-Pack.jpg

Because 988563549 exists in ids.txt so we need this record in urls.log, otherwise, we don't need the line, and we also don't need /product/80x80/5252352/ because it's folder not image.

What I write in PHP is:

$file = '/combined/combine.url.sanitized.access_log';
$handle = fopen($file, "r");
if ($handle) {
    while (($line = fgets($handle)) !== false) 
    {
        $handleids = fopen('/script/ids.txt', "r");
        while (($lineIds = fgets($handleids)) !== false) 
        {
            if (strpos($line, trim($lineIds)) !== false) 
            {
                file_put_contents('result.txt', $line . PHP_EOL, FILE_APPEND | LOCK_EX);
                break;
            }
        }
        fclose($handleids);
        file_put_contents('result.txt', '=' . PHP_EOL, FILE_APPEND | LOCK_EX);
    }

    fclose($handle);
}

This work so slowly, I calculated the time, approximately need 60 days. So how should I improve it? It is OK to use other language to achieve that, but I'm not familiar to other language, so please tell me more detail.

`file_put_contents` will create a file pointer, write to it, and close it again for each call. Seeing as you're appending in a loop, `fopen` would be the better option. Be that as it may: PHP is not really suited to process log files of 1 Gb in size. Other languages are much better equipped for this. Go, R, Python... — Elias Van Ootegem, Jan 29 '15 at 11:02
*This work so slowly* this is right, if you can use os fuctions to get your result eg. sed, grep, comm or write a perl / bash script — donald123, Jan 29 '15 at 11:02

score 2 · Accepted Answer · edited May 23 '17 at 12:12

When you have a file full of patterns, and another file to search for those patterns, you can use the -f option of grep (-F is used as your pattern file only contains fixed strings, not regex patterns):

grep -Ff ids.txt urls.log

To ignore anything that ends in a slash, you can pipe to grep again, using -v to exclude patterns this time:

grep -Ff ids.txt urls.log | grep -v /$ > result.txt

This should be faster than your PHP script. If it is still too slow, you may want to look into using Perl (e.g. this question) or Python.

score 0 · Answer 2 · answered Jan 29 '15 at 11:08

first, u can cache the ids.txt into a set. then, start a reactor thread to iterate urls.log each line into a queue, and start some work threads to consume this queue, in each work thread, u use the set which made by ids.txt to filter each line in urls.log.

How to filter out data from big file on Linux?

2 Answers2