I've two files, urls.log
(1Gb) and ids.txt
(20M), the first file urls.log
look like this:
/product/80x80/436284940/
/product/100x100/1051907917/Pavillon-2.jpg
/product/140x140/988563549/LITTLE-ROSE-Mikrofasermischung-Maxi-Slips-uni-5er-Pack.jpg
/product/100x100/504170379/Dunlop-SP-Sport-Maxx-215-40R17-87V-XL-VW1-MFS.jpg
...
The second file ids.txt
look like this:
988563549
988563540
988563541
...
The result should be: (result.txt
)
/product/140x140/988563549/LITTLE-ROSE-Mikrofasermischung-Maxi-Slips-uni-5er-Pack.jpg
Because 988563549
exists in ids.txt
so we need this record in urls.log
, otherwise, we don't need the line, and we also don't need /product/80x80/5252352/
because it's folder not image.
What I write in PHP is:
$file = '/combined/combine.url.sanitized.access_log';
$handle = fopen($file, "r");
if ($handle) {
while (($line = fgets($handle)) !== false)
{
$handleids = fopen('/script/ids.txt', "r");
while (($lineIds = fgets($handleids)) !== false)
{
if (strpos($line, trim($lineIds)) !== false)
{
file_put_contents('result.txt', $line . PHP_EOL, FILE_APPEND | LOCK_EX);
break;
}
}
fclose($handleids);
file_put_contents('result.txt', '=' . PHP_EOL, FILE_APPEND | LOCK_EX);
}
fclose($handle);
}
This work so slowly, I calculated the time, approximately need 60 days. So how should I improve it? It is OK to use other language to achieve that, but I'm not familiar to other language, so please tell me more detail.