I have a dir with almost 100 log files each weighing at 10~15 GB. The requirement is to read each file line by line(order doesn't matter at all), clean up the line json and dump it to the backend elasticsearch storage for indexing.
here is my worker that does this job
# file = worker.php
echo " -- New PHP Worker Started -- "; // to get how many times gnu-parallel initiated the worker
$dataSet = [];
while (false !== ($line = fgets(STDIN))) {
// convert line text to json
$l = json_decode($line);
$dataSet[] = $l;
if(sizeof($dataSet) >= 1000) {
//index json to elasticsearch
$elasticsearch->bulkIndex($dataSet);
$dataSet = [];
}
}
With the help of answers here and here I am almost there and it is working (kind of) but just need to make sure that under the hood it is actually doing the stuff that I am assuming it is doing.
With just one file I can handle it as below
parallel --pipepart -a 10GB_input_file.txt --round-robin php worker.php
That works great. adding --round-robin makes sure that php worker process is started only once and then it just keeps receiving the data as pipeline (poor man's queue).
So for 4CPU machine, it fires up 4 php workers and crunches all the data very quickly.
To do the same for all files, here is my take on it
find /data/directory -maxdepth 1 -type f | parallel cat | parallel --pipe -N10000 --round-robin php worker.php
Which kinda looks like working but I have a gut feeling that this is a wrong way of nesting parallel for all files.
And secondly, as it can not use --pipepart, I think it is slower.
Thirdly, once the job is complete, I see that on a 4cpu machine, only 4 workers were started and job got done. Is it right behavior? Shouldn't it start 4 workers for every file? Just wanna make sure I didn't miss any data.
Any idea how this could be done in a better way?