Iterate over individual bytes in Python 3

Question

When iterating over a bytes object in Python 3, one gets the individual bytes as ints:

>>> [b for b in b'123']
[49, 50, 51]

How to get 1-length bytes objects instead?

The following is possible, but not very obvious for the reader and most likely performs bad:

>>> [bytes([b]) for b in b'123']
[b'1', b'2', b'3']

Does anybody know why Python3 returns integers? I personally prefer the behaviour of Python2. — guettli, Aug 14 '19 at 10:33
Because that’s what a byte string is: A series of numbers from 0-255 that can be used to represent any kind of data. — flying sheep, Aug 14 '19 at 11:19
I wonder whether an array object would suit your purposes better and avoid unnecessary conversions. — Mayur Patel, Jan 10 '13 at 21:42
behaves the same, or what do you mean? `>>>[b for b in bytearray(b"123")]` ⇒ `[49, 50, 51]` — flying sheep, Jan 10 '13 at 22:00
I do not believe there is a distinct "character" type in python. If you look in the docs for the array module, you'll see that "characters" in python are 1-byte integers. So the results you are seeing are consistent. However, I am recommending an array (without a full understanding of your application) to suggest that it will avoid unnecessary type conversions and object constructions that might occur if you use lists. I suspect even strings will result in extra work, but I'm not sure. As others have noted, you can then use indexing to extract the item you need. — Mayur Patel, Jan 11 '13 at 17:59

jfs · Accepted Answer · 2016-12-07T04:38:35.070

47

If you are concerned about performance of this code and an int as a byte is not suitable interface in your case then you should probably reconsider data structures that you use e.g., use str objects instead.

You could slice the bytes object to get 1-length bytes objects:

L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

There is PEP 0467 -- Minor API improvements for binary sequences that proposes bytes.iterbytes() method:

>>> list(b'123'.iterbytes())
[b'1', b'2', b'3']

edited Dec 07 '16 at 04:38

answered Jan 10 '13 at 21:53

jfs

399,953
195
994
1,670

@flyingsheep: there are other solutions depending on what do you need exactly e.g., `L = list(memoryview(bytes_obj))` – jfs Jan 10 '13 at 22:02
that creates a list of integers again – flying sheep Jan 10 '13 at 22:10
@flyingsheep: I certainly get that this doesn't look very Pythonic. However, the design of Python 3 forces a certain awkwardness in handling bytes, so this indeed might be the most idiomatic form, if you really must stick to bytes. – John Y Jan 10 '13 at 22:13
@flyingsheep: yes, `list(memoryview(bytes_obj))` returns a list of ints on Python 3.3+ (I've tried it on Python 3.2.3 where it returns a list of bytes objects). – jfs Jan 10 '13 at 22:31
oh, wow, that’s quite a big API change! might be justified as bugfix, though. – flying sheep Jan 10 '13 at 23:33
2

@PavelŠimerda: there is pep 467 that may improve this particular use-case i.e., John Y is not alone in thinking that Python 3 API for bytes can be improved. – jfs Sep 22 '14 at 09:36
5

`iterbytes` doesn't appear to work as of python 3.8 – Lord Elrond Jul 24 '20 at 01:11
1

“This [`iterbytes`] PEP has been deferred until Python 3.9 at the earliest, as the open questions aren't currently expected to be resolved in time for the Python 3.8 feature addition deadline in May 2019 (if you're keen to see these changes implemented and are willing to drive that resolution process, contact the PEP authors).” https://www.python.org/dev/peps/pep-0467/ – jfs Jul 24 '20 at 19:44
This is still not available in Python 3.11, the newest version of Python. As per the PEP we'll get this functionality in Python 3.12. – Joooeey Nov 17 '22 at 10:12
@Joooeey The [python version bump seems automatic](https://github.com/python/peps/commit/f613ad88018e8edda94977074fc8d633cfd6225d) and the pep is in Draft status i.e., I wouldn't hold my breath that it happens in 3.12. – jfs Nov 17 '22 at 17:21

snakecharmerb · Answer 2 · 2019-08-18T12:44:18.230

int.to_bytes

int objects have a to_bytes method which can be used to convert an int to its corresponding byte:

>>> import sys
>>> [i.to_bytes(1, sys.byteorder) for i in b'123']
[b'1', b'2', b'3']

As with some other other answers, it's not clear that this is more readable than the OP's original solution: the length and byteorder arguments make it noisier I think.

struct.unpack

Another approach would be to use struct.unpack, though this might also be considered difficult to read, unless you are familiar with the struct module:

>>> import struct
>>> struct.unpack('3c', b'123')
(b'1', b'2', b'3')

(As jfs observes in the comments, the format string for struct.unpack can be constructed dynamically; in this case we know the number of individual bytes in the result must equal the number of bytes in the original bytestring, so struct.unpack(str(len(bytestring)) + 'c', bytestring) is possible.)

Performance

>>> import random, timeit
>>> bs = bytes(random.randint(0, 255) for i in range(100))

>>> # OP's solution
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bytes([b]) for b in bs]")
46.49886950897053

>>> # Accepted answer from jfs
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bs[i:i+1] for i in range(len(bs))]")
20.91463226894848

>>>  # Leon's answer
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="list(map(bytes, zip(bs)))")
27.476876026019454

>>> # guettli's answer
>>> timeit.timeit(setup="from __main__ import iter_bytes, bs",        
                  stmt="list(iter_bytes(bs))")
24.107485140906647

>>> # user38's answer (with Leon's suggested fix)
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="[chr(i).encode('latin-1') for i in bs]")
45.937552741961554

>>> # Using int.to_bytes
>>> timeit.timeit(setup="from __main__ import bs;from sys import byteorder", 
                  stmt="[x.to_bytes(1, byteorder) for x in bs]")
32.197659170022234

>>> # Using struct.unpack, converting the resulting tuple to list
>>> # to be fair to other methods
>>> timeit.timeit(setup="from __main__ import bs;from struct import unpack", 
                  stmt="list(unpack('100c', bs))")
1.902243083808571

struct.unpack seems to be at least an order of magnitude faster than other methods, presumably because it operates at the byte level. int.to_bytes, on the other hand, performs worse than most of the "obvious" approaches.

@Leon FWIW I think your answer is the most pythonic; I guess the destination of the bounty will depend on whether the bounty-giver wants readability or performance :) (or the appearance of more, better answers). — snakecharmerb, Aug 19 '19 at 06:57

score 12 · Answer 3 · answered Aug 20 '19 at 14:20

I thought it might be useful to compare the runtimes of the different approaches so I made a benchmark (using my library simple_benchmark):

Probably unsurprisingly the NumPy solution is by far the fastest solution for large bytes object.

But if a resulting list is desired then both the NumPy solution (with the tolist()) and the struct solution are much faster than the other alternatives.

I didn't include guettlis answer because it's almost identical to jfs solution just instead of a comprehension a generator function is used.

import numpy as np
import struct
import sys

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

@b.add_function()
def jfs(bytes_obj):
    return [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

@b.add_function()
def snakecharmerb_tobytes(bytes_obj):
    return [i.to_bytes(1, sys.byteorder) for i in bytes_obj]

@b.add_function()
def snakecharmerb_struct(bytes_obj):
    return struct.unpack(str(len(bytes_obj)) + 'c', bytes_obj)

@b.add_function()
def Leon(bytes_obj):
    return list(map(bytes, zip(bytes_obj)))

@b.add_function()
def rusu_ro1_format(bytes_obj):
    return [b'%c' % i for i in bytes_obj]

@b.add_function()
def rusu_ro1_numpy(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1')

@b.add_function()
def rusu_ro1_numpy_tolist(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1').tolist()

@b.add_function()
def User38(bytes_obj):
    return [chr(i).encode() for i in bytes_obj]

@b.add_arguments('byte object length')
def argument_provider():
    for exp in range(2, 18):
        size = 2**exp
        yield size, b'a' * size

r = b.run()
r.plot()

Nice chart. In my current context the performance does not matter at all. It should work and the code should look readable and easy to understand. — guettli, Aug 20 '19 at 14:49
note: `rusu_ro1_numpy` does not actually "iterate over individual bytes" (the benchmark shows it doesn't even copy the bytes -- the time is constant -- why do we need a numpy array here? a `bytes_obj` is already an iterable (over `int`s)). If an iterable (over `bytes`) is acceptable as a solution then your benchmark shows that `snakecharmerb_struct` is the fastest (though it copies the bytes, it doesn't "iterate over"). The benchmark says that `bytes_obj[i:i+1]` variant is the fastest among solutions that do iterate over individual bytes. — jfs, Sep 07 '19 at 07:58
@jfs Yeah, that's correct. The NumPy and struct solution only represent the iterable as bytes, they don't iterate over them. However these solutions gathered several upvotes so it would be unfair to exclude them, but maybe I should've discussed the differences in more details. Maybe I find the time to revise the answer in the next days. Thank you. — MSeifert, Sep 10 '19 at 19:26

kederrac · Answer 4 · 2019-08-19T20:57:10.047

since python 3.5 you can use % formatting to bytes and bytearray:

[b'%c' % i for i in b'123']

output:

[b'1', b'2', b'3']

the above solution is 2-3 times faster than your initial approach, if you want a more fast solution I will suggest to use numpy.frombuffer:

import numpy as np
np.frombuffer(b'123', dtype='S1')

output:

array([b'1', b'2', b'3'], 
      dtype='|S1')

The second solution is ~10% faster than struct.unpack (I have used the same performance test as @snakecharmerb, against 100 random bytes)

score 7 · Answer 5 · answered Aug 14 '19 at 20:34

7

A trio of map(), bytes() and zip() does the trick:

>>> list(map(bytes, zip(b'123')))
[b'1', b'2', b'3']

However I don't think that it is any more readable than [bytes([b]) for b in b'123'] or performs better.

answered Aug 14 '19 at 20:34

Leon

31,443
4
72
97

score 6 · Answer 6 · answered Aug 14 '19 at 10:46

6

I use this helper method:

def iter_bytes(my_bytes):
    for i in range(len(my_bytes)):
        yield my_bytes[i:i+1]

Works for Python2 and Python3.

answered Aug 14 '19 at 10:46

guettli

25,042
81
346
663

user38 · Answer 7 · 2021-05-15T17:07:18.303

1

A short way to do this:

[bytes([i]) for i in b'123\xaa\xbb\xcc\xff']

edited May 15 '21 at 17:07

answered Aug 18 '19 at 00:22

user38

151
1
14

3

It doesn't work if the input `bytes` object contains values from the 128-255 range. You have to use the `latin-1` (same as `iso-8859-1`) encoding to fix that: `[chr(i).encode('latin-1') for i in b'\x80\xb2\xff']` – Leon Aug 18 '19 at 06:14

Iterate over individual bytes in Python 3

7 Answers7

Linked

Related