1

I have a python3 script which reads data into a buffer with

fp = open("filename", 'rb')
data = fp.read(count)

I don't fully understand (even after reading the documentation) what read() returns. It appears to be some kind of binary data which is iterable. But it is not a list.

Confusingly, elsewhere in the script, lists are used for binary data.

frames = []
# then later... inside a loop
for ...
    data = b''.join(frames)

Regardless... I want to iterate over the object returned by read() in units of word (aka 2 byte blocks)

At the moment the script contains this for loop

for c in data:
    # do something

Is it possible to change c such that this loop iterates over words (2 byte blocks) rather than individual bytes?

I cannot use read() in a loop to read 2 bytes at a time.

Neuron
  • 5,141
  • 5
  • 38
  • 59
FreelanceConsultant
  • 13,167
  • 27
  • 115
  • 225
  • 1
    "I don't fully understand (even after reading the documentation) what read() returns. It appears to be some kind of binary data which is iterable. But it is not a list." Was [this](https://docs.python.org/3/library/functions.html#open) the documentation you found? It says right there, `Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding`, and the word `bytes` is linked to the documentation for that type. – Karl Knechtel Mar 24 '21 at 12:07
  • "Confusingly, elsewhere in the script, lists are used for binary data." *A* list is used, which presumably contains `bytes` objects that are concatenated using the `join` method of `b''` (another `bytes` object). – Karl Knechtel Mar 24 '21 at 12:08
  • What type of object do you expect `c` to be? In its current form, `c` would be an `int`. However in "2 byte blocks" it would be a `bytes` object. – Axe319 Mar 24 '21 at 12:38
  • @KarlKnechtel That appears to be the documentation for `open()`? Not `read()`. – FreelanceConsultant Mar 24 '21 at 12:39
  • 1
    Yes, it's the documentation for `open()`. `open()` is how you create file objects, and the documentation there explains what you can do with those. `.read()` is a method of file objects, so that's where you get the appropriate documentation. – Karl Knechtel Mar 24 '21 at 12:41
  • You could also click through the link to [Reading and Writing Files](https://docs.python.org/3/tutorial/inputoutput.html#tut-files) from the official Python tutorial, which explains: `'b' appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.` – Karl Knechtel Mar 24 '21 at 12:42

3 Answers3

1

We can explicitly read (up to) n bytes from a file in binary mode with .read(n) (just as it would read n Unicode code points from a file opened in text mode). This is a blocking call and will only read fewer bytes at the end of the file.

We can use the two-argument form of iter to build an iterator that repeatedly calls a callable:

>>> help(iter)
Help on built-in function iter in module builtins:

iter(...)
    iter(iterable) -> iterator
    iter(callable, sentinel) -> iterator

    Get an iterator from an object.  In the first form, the argument must
    supply its own iterator, or be a sequence.
    In the second form, the callable is called until it returns the sentinel.

read at the end of the file will start returning empty results and not raise an exception, so we can use that for our sentinel.

Putting it together, we get:

for pair in iter(lambda: fp.read(2), b''):

Inside the loop, we will get bytes objects that represent two bytes of data. You should check the documentation to understand how to work with these.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
1

When reading a file in binary mode, a bytes object is returned, which is one of the standard python builtins. In general, its representation in the code looks like that of a string, except that it is prefixed as b" " - When you try printing it, each byte may be displayed with an escape like \x** where ** are 2 hex digits corresponding to the byte's value from 0 to 255, or directly as a single printable ascii character, with the same ascii codepoint as the number. You can read more about this and methods etc of bytes (also similar to those for strings) in the bytes docs.

There already seems to be a very popular question on stack overflow about how to iterate over a bytes object. The currently accepted answer gives this example for creating a list of individual bytes in the bytes object :

L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

I suppose that modifying it like this will work for you :

L = [bytes_obj[i:i+2] for i in range(0, len(bytes_obj), 2)]

For example :

by = b"\x00\x01\x02\x03\x04\x05\x06" 
# The object returned by file.read() is also bytes, like the one above
words = [by[i:i+2] for i in range(0, len(by), 2)]
print(words)
# Output --> [b'\x00\x01', b'\x02\x03', b'\x04\x05', b'\x06']

Or create a generator that yields words in the same way if your list is likely to be too large to efficiently store at once:

def get_words(bytesobject):
    for i in range(0, len(bytesobject), 2):
        yield bytesobject[i:i+2]
gdcodes
  • 266
  • 2
  • 10
  • Ok thanks, this is pretty useful. It seems like I'm actually asking the wrong question. Perhaps I should have asked, "why does iterating over bytes object return `int`s" – FreelanceConsultant Mar 24 '21 at 16:16
  • "Perhaps I should have asked, "why does iterating over bytes object return ints" Because a `bytes` object represents a sequence of bytes *of memory*, and the simplest way to interpret an 8-bit value is as an 8-bit unsigned integer? Seems about right to me. Certainly I have found it useful in the past. I'm not sure what else you expected; it's not a string type (in 3.x; and the 2.x design was frankly quite flawed). – Karl Knechtel Mar 24 '21 at 17:57
0

In the most simple literal sense, something like this gives you a two byte at a time loop.

with open("/etc/passwd", "rb") as f:
    w = f.read(2)
    while len(w) > 0:
        print( w  )
        w = f.read(2)

as for what you are getting from read, it's a bytes object, because you have specified 'b' as an option to the `open

I think a more python way to express it would be via an iterator or generator.

cms
  • 5,864
  • 2
  • 28
  • 31