1

I have binary file composed by several (+10k) records of 6 bytes each:

As example record a have a byte string like this ÿvDV

Ascii : 152 + 118 + 68 + 86 + 27 + 15
Binary: 000100000101101100000010010100110000000000000000

From the string I have to extract a list of "some" bits in some specific position and then to cast their value to integer:

1001100001110110 0100010001010110 0001101100001111
--     |-------| --     |-------| |------||------|
|          |      |          |        |       |
|          |      |          |        |       +------> 0001111     => $bin  = 1 5
|          |      |          |        +--------------> 00011011    => $site = 27
|          |      |          +-----------------------> 001010110   => $x    = 86
|          |      +----------------------------------> 01          => $dp   = 1
|          +-----------------------------------------> 001110110   => $y    = 118
+----------------------------------------------------> 10          => $tr   = 2

Is there a faster approach than this one?

$binary = file_get_contents("/path/to/binary/file.dat");
$startbyte = 0; //I'm reading 55nth byte
while($startbyte <= strlen($binary)) {
    $record = unpack("n3", substr($binary, $startbyte, 6));
    $info = array(
        'tr'                => ($record[1] & 0xC000) >> 14,
        'y'                 => ($record[1] & 0x01FF),
        'dp'                => ($record[2] & 0xC000) >> 14,
        'x'                 => ($record[2] & 0x01FF),
        'site'              => (int) (($record[3] & 0x3F00) >> 8),
        'bin'               => (int) ($record[3] & 0x003F) + 1)
    );
    $startbyte += 6;
}

In your experience, there would be a faster approach?

Stefano Radaelli
  • 1,088
  • 2
  • 15
  • 35
  • 1
    I'm voting to close this question as off-topic because this question better suits [codereview](https://codereview.stackexchange.com/) I think? – RiggsFolly Nov 15 '18 at 16:13
  • Don't know if it's necessarily faster, but: https://stackoverflow.com/q/11653931/1255289 – miken32 Nov 15 '18 at 16:15
  • 1
    @RiggsFolly Maybe you're right but maybe more eyeballs is better for this situation. The `optimization` tag does exist here and OP clearly showed a good understanding of the code at hand. The only thing I would ask for is maybe some sample data (20+ records?) and the current benchmark. – MonkeyZeus Nov 15 '18 at 16:17
  • @MonkeyZeus, to parse 73530 records it took 1,59 seconds. The target is less than one. Anyway mine is only a request if in your mind exists a better "logical" approach maybe using some features of "unpack" command that's not documented. riggsFolly, probably you're right. Feel free to close it anytime if you think that this would not be the right place. – Stefano Radaelli Nov 15 '18 at 16:41
  • 1
    I don't see any room for improvement in this code at all. You should profile what is happening when the code is run. Eg: are you capping out on CPU, IO, or memory? Also *why* does this need to run in less than one second? You may benefit for re-thinking you approach and where/how this bit of code fits into the larger application. – Sammitch Nov 15 '18 at 17:40
  • 1
    I agree with @Sammitch in regards to profiling the code; additionally, I trust their judgement in regard to having no room for improvement because I am not familiar with `unpack()`. If anything I would choose to slightly slow down the code by reading the file one record at a time instead of using `file_get_contents()` to avoid fatal memory exhaustion errors. – MonkeyZeus Nov 15 '18 at 17:52
  • 1
    @MonkeyZeus hah I missed that one. I actually feel like changing it to `$fh = fopen(...); while( $record = fread($fh, 6) ) { ... }` would also be a slight performance improvement given PHP's solid IO internals. Though definitely not enough to shave 40% off the run time. – Sammitch Nov 15 '18 at 18:55
  • 1
    @Sammitch Hmm, very interesting because benchmarking [`fgets()`](https://stackoverflow.com/a/13246630/2191572) vs. `file_get_contents()` against a 600k and 70MB Apache log shows that line-by-line actually is faster than reading the whole file all at once in most test runs. I figured the PHP overhead of calling `fgets()` thousands/millions of times would be a bottleneck but I guess we learn something new every day :-) – MonkeyZeus Nov 15 '18 at 19:22
  • 1
    @MonkeyZeus PHP's IO functions aren't direct passthroughs to the underlying C/system calls, they hook into Streams which more efficiently manage IO. So you're not reading 6 bytes off the disk at a time, you're reading out of PHP's stream buffer which is intelligently managed/filled behind the scenes. I discovered this the hard way after trying to implement my own IO buffering which horribly tanked my performance. :P – Sammitch Nov 15 '18 at 19:27
  • @Sammitch Learn something new every day – MonkeyZeus Nov 15 '18 at 19:34
  • 1
    @Sammitch Actually, OP *could* see substantial savings with `fread()` because they could avoid thousands of `strlen()` and `substr()` calls. – MonkeyZeus Nov 15 '18 at 19:38
  • 1
    @StefanoRadaelli Are you still around? You may wish to try `fread()` per this comment thread. Unfortunately I cannot test the performance for you because I do not have a data file to test with. Please let me know if `fread()` helped. Good luck! – MonkeyZeus Nov 15 '18 at 20:18
  • @MonkeyZeus he could save thousands of `strlen()` calls simply by not doing them in the loop. I'm going to have to retract my original "I don't see any room for improvement in this code at all" statement. :P – Sammitch Nov 15 '18 at 20:29
  • 1
    @Sammitch For sure! One additional unknown is OP's PHP version. If they are using 5.x then going to 7.x could prove to have substantial performance gains. – MonkeyZeus Nov 15 '18 at 20:37
  • @sammitch, I've tried to replace file_get_contents() with fread() and it improve performances by ~5%. In any case and accordingly with suggestion, my code seems to be very close with the fastest possible approach. The request is depending by the fact that I have to read 50 binary different files every 60/120 seconds (the algorithm is running in a production environment producing "overall" ~2.2Tb of data every day). At the moment the workaround would be to "parellalize" the data parsing using all the server available threads that I have. Or to evaluate the performance increment with PHP 7. – Stefano Radaelli Nov 16 '18 at 08:49
  • Could you post your code with `fread()`? Maybe we can spot other things to try. – MonkeyZeus Nov 16 '18 at 13:50
  • I'd suggest a queue/worker arrangement with RabbitMQ. – Sammitch Nov 16 '18 at 16:19

0 Answers0