How do I extract bytes with offsets from a huge block efficiently in Python?

Question

Let's suppose I have a block of bytes like this:

block = b'0123456789AB'

I want to extract each sequence of 3 bytes from each chunk of 4 bytes and join them together. The result for the block above should be:

b'01245689A'  # 3, 7 and B are missed

I could solve this issue with such script:

block = b'0123456789AB'
result = b''
for i in range(0, len(block), 4):
    result += block[i:i + 3]
print(result)

But as it's known, Python is quite inefficient with such for-loops and bytes concatenations, thus my approach will never end if I apply it for a really huge block of bytes. So is there a faster way to perform?

"as it's known" - is this actually a problem you're having? If yes, maybe you should not be doing it in python if performance matters that much — lucidbrot, Feb 04 '21 at 14:26
Is [memoryview](https://stackoverflow.com/a/34257357/2550406) maybe what you're looking for? — lucidbrot, Feb 04 '21 at 14:29

Kelly Bundy · Accepted Answer · 2021-02-04T15:21:20.900

3

Make it mutable and delete the the unwanted slice?

>>> tmp = bytearray(block)
>>> del tmp[3::4]
>>> bytes(tmp)
b'01245689A'

If your chunks are large and you want to remove almost all bytes, it might become faster to instead collect what you do want, similar to yours. Although yours potentially takes quadratic time, better use join:

>>> b''.join([block[i : i+3] for i in range(0, len(block), 4)])
b'01245689A'

(Btw according to PEP 8 it should be block[i : i+3], not block[i:i + 3], and for good reason.)

Although that builds a lot of objects, which could be a memory problem. And for your stated case, it's much faster than yours but much slower than my bytearray one.

Benchmark with block = b'0123456789AB' * 100_000 (much smaller than the 1GB you mentioned in the comments below):

    0.00 ms      0.00 ms      0.00 ms  baseline
15267.60 ms  14724.33 ms  14712.70 ms  original
    2.46 ms      2.46 ms      3.45 ms  Kelly_Bundy_bytearray
   83.66 ms     85.27 ms    122.88 ms  Kelly_Bundy_join

Benchmark code:

import timeit

def baseline(block):
    pass

def original(block):
    result = b''
    for i in range(0, len(block), 4):
        result += block[i:i + 3]
    return result

def Kelly_Bundy_bytearray(block):
    tmp = bytearray(block)
    del tmp[3::4]
    return bytes(tmp)

def Kelly_Bundy_join(block):
    return b''.join([block[i : i+3] for i in range(0, len(block), 4)])

funcs = [
    baseline,
    original,
    Kelly_Bundy_bytearray,
    Kelly_Bundy_join,
    ]

block = b'0123456789AB' * 100_000
args = block,
number = 10**0

expect = original(*args)
for func in funcs:
    print(func(*args) == expect, func.__name__)
print()

tss = [[] for _ in funcs]
for _ in range(3):
    for func, ts in zip(funcs, tss):
        t = min(timeit.repeat(lambda: func(*args), number=number)) / number
        ts.append(t)
        print(*('%8.2f ms ' % (1e3 * t) for t in ts), func.__name__)
    print()

edited Feb 04 '21 at 15:21

answered Feb 04 '21 at 14:22

Kelly Bundy

23,480
7
29
65

1

Wow! It is impressive. This works. I tried your code on a block with size 1Gb and it's been processed for 1 second on my machine. Thank you! – Fomalhaut Feb 04 '21 at 14:35
If I had more than 1 byte to miss in each chunk, I should've applied `del` several times? – Fomalhaut Feb 04 '21 at 14:39
@Fomalhaut Yes, `del` multiple times. Unless you have large chunks and want to remove almost all of their bytes, then something else might be faster. – Kelly Bundy Feb 04 '21 at 14:44
@Fomalhaut Added some more stuff to the answer. – Kelly Bundy Feb 04 '21 at 15:21
Thank you for the research! In the line `b''.join([block[i : i+3] for i in range(0, len(block), 4)])` I would recommend to use a generator instead of list because of memory consumption and performance: `b''.join(block[i : i+3] for i in range(0, len(block), 4))`. – Fomalhaut Feb 05 '21 at 07:32
@Fomalhaut [No](https://stackoverflow.com/a/9061024/12671057). Unless you can show that that's obsolete or doesn't apply to `bytes`. I intentionally chose list comprehension for performance. Did you actually benchmark that, which my benchmark code makes trivial? I did now, and your suggestion was consistently about factor 1.14 *slower*. – Kelly Bundy Feb 05 '21 at 13:51
Thanks. I did ran your code and the way with the generator was faster on my machine (I ran it once only)... But now when I replaced 100_000 with 1_000_000, the solution with list was better. Maybe it is a kind of inaccuracy of measurement, because 100_000 is not high enough. Anyway thank you for the discussion about it. It is an interesting subject. – Fomalhaut Feb 05 '21 at 14:13
Hmm, ok, sorry for being a bit snippy there. Though if you do run benchmarks, better say so and show the results :-). I think list comp should be faster in that benchmark almost all the time, at least if they get the same computational power. I just ran the benchmark (with only those two solutions) in a loop ten times, and list comp was faster 28 out of 30 times, mostly a lot faster, and the two times it lost it wasn't by much. Was maybe other significant stuff running on the PC during the benchmark, or did you do it on a laptop with power-saving mode so that the CPU speed changed a lot or so? – Kelly Bundy Feb 05 '21 at 14:58

How do I extract bytes with offsets from a huge block efficiently in Python?

1 Answers1