1

Short version first, long version will follow :

Short :

I have a 2D matrix of float32. I want to write it to a .txt file as bytearray. I also want to keep the structure, which means adding a newline character at the end of a row. Some numbers like 683.61, when converted to bytearray include \n which produces an undesired newline character and messes up the reading ot the file as lines. How can I do this?

Long :

I am writing a program to work with huge arrays of datas (2D matrix). For that purpose, I need the array stored on disk rather then on my ram as the datas might be too big for the ram of the computer. I created my own type of file which is going to be read by the program. It has a header with important parameter as bytes followed by the matrix as bytearrays.

As I write the datas to the file one float32 at a time, I add a newline (\n) character at the end of one row of the matrix, so I keep the structure.

Writing goes well, but reading causes issues as some numbers, once converted to byte array, include \n.

As an example :

struct.pack('f',683.61)

will yield

b'\n\xe7*D'

This cuts my matrix rows as well as sometimes cut in the middle of a bytearray making the bytearray size wrong.

From this question : Python handling newline and tab characters when writing to file

I found out that a str can be encoded with 'unicode_escape' to double the backslash and avoid confusion when reading.

Some_string.encode('unicode_escape')

However, this method only works on strings, not bytes or bytearrays. (I tryed it) This means I can't use it when I directly convert a float32 to a bytearray and write it to a file.

I have also tryed to convert the float to bytearray, decode the bytearray as a str and reencode it like so :

struct.pack('f',683.61).decode('utf-8').encode('unicode_escape')

but decode can't decode bytearrays.

I have also tryed converting the bytearray to string directly then encoding like so :

str(struct.pack('f',683.61)).encode('unicode_escape')

This yields a mess from which it is possible to get the right bytes with this :

bytes("b'\\n\\xe7*D'"[2:-1],'utf-8')

And finally, when I actually read the byte array, I obtain two different results wheter the unicode_escape has been used of not :

numpy.frombuffer(b'\n\xe7*D', dtype=float32)
    yields : array([683.61], dtype=float32)

numpy.frombuffer(b'\\n\\xe7*D', dtype=float32)
    yields : array([1.7883495e+34, 6.8086554e+02], dtype=float32)

I am expecting the top restults, not the bottom one. So I am back to square one.

--> How can I encode my matrix of floats as a bytearray, on multiple lines, without being affected by newline character in the bytearrays?

F.Y.I. I decode the bytearray with numpy as this is the working method I found, but it might not be the best way. Just starting to play around with bytes.

Thank you for you help. If there is any issue with my question, please inform me, I will gladly rewrite it properly if it was wrong.

Gonzalez87
  • 162
  • 1
  • 9

2 Answers2

2

You either write your data as binary data, or you use newlines to keep it readable - it does not even make sense otherwise.

When you are trying to record "bytes" to a file, and have float32 values raw as a 4 byte sequence, each of those bytes can, of course, have any value from 0-255 - and some of these will be control characters.

The alternatives are to serialize to a format that will encode your byte values to characters in the printable ASCII range, like base64, Json, or even pickle, using protocol 0.

Perhaps what will be most confortable for you is just to write your raw bytes in a binary byte, and change the programs you are using to interact with it - using and hexeditor like "hexedit" or Midnight Commander. Both will allow you to browse your bytes by their hexadecimal representation in a confortable way, and will display eventual ASCII-text sequences inside the files.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
0

For anyone having the same questionning as I did, trying to keep the readline function working with byte, the previous answer from @jsbueno got me thinking of alternate ways to proceed rather than modify the bytes.

Here is an alternative if like me you are making your own file with data as bytes. write your own readline() function based on the classic read() function, but with a customized "newline character". Here is what I worked out :

def readline(file, newline=b'Some_byte',size=None):
    buffer = bytearray()
    if size is None :
        while 1 :
            buffer += file.read(1)
            if buffer.endswith(newline):
                break
    else :
        while len(buffer) < size :
            buffer += file.read(1)
            if buffer.endswith(newline):
                break
    return buffer
Gonzalez87
  • 162
  • 1
  • 9