1

f.read(1) will return 1 byte, not one character. The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string. There is no newline character at the end of the string. How do I read such strings?

I have seen this question but none of the answers address the UTF-8 case.

Example code:

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x41')
    f.write(b'\xD0')
    f.write(b'\xB1')
    f.write(b'\xC0')

with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.buffer.read(1), '+', f.read(1))

This outputs:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte

When f.write(b'\xC0') is removed, it works as expected. It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.

Sergey Slepov
  • 1,861
  • 13
  • 33
  • can you add example of the file? – Ido May 18 '21 at 16:02
  • @Ido Any UTF-8 encoded file will do. I just need a way to read one character. Once I know how to do that, I can put it in a loop and read the whole string :) – Sergey Slepov May 18 '21 at 16:08
  • 1
    Read one byte, if it’s over ord 127, keep reading until you encounter a byte in the 127 range or over 192… – deceze May 18 '21 at 16:13
  • 2
    The fundamental problem is that "1 character" is really tricky to pin down to a simple technical definition. Do you mean one unicode codepoint? Or one grapheme cluster? Or something else? I know that "but why do you need it" is an annoying question, but in this case understanding the *reason* for wanting to do this can make it much easier to get to a sensible solution. This might be a [XY problem](https://xyproblem.info). – Joachim Sauer May 18 '21 at 16:16
  • 2
    In a nutshell: for *real* unicode text handling you should let go of the notion of manually splitting the text into some units (unless you are writing a library dedicated to that task), because the units are **way** more complicated that many of us assume. Splitting words is way more complicated than in a boring old ASCII world and so is defining "character". Your best bet is to handle "big" chunks (such as whole lines or at least strings of some size) and let the unicode-enabled libraries do their thing. – Joachim Sauer May 18 '21 at 16:23
  • @JoachimSauer, it's Unicode codepoints that I'm after. The reason I am doing this is I want to be able to read this file format. I *could* change the file format (because I wrote it) but I don't want to do it (yet) just because Python can't read it. – Sergey Slepov May 19 '21 at 07:36

2 Answers2

1

Here's a character that takes up more than one byte. Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write('⾀'.encode('utf-8'))
    f.write(b'\x01')
    
with open(file, 'rb') as f:
    print(f.read(1))
with open(file, 'r') as f:
    print(f.read(1))

Output:

b'\xe2'
⾀

Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1).

This works even if your character isn't in the beginning of the file:

file = 'temp.txt'
with open(file, 'wb') as f:
    f.write(b'\x01')
    f.write('⾀'.encode('utf-8'))

    
with open(file, 'rb') as f:
    print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
    print(f.read(1),'+', f.read(1))

If this does not work for you please provide an example.

rudolfovic
  • 3,163
  • 2
  • 14
  • 38
  • This works, but uses "codepoint" to mean "character". Some characters are made up of multiple codepoints. For example `ë` is two unicode codepoints U+0065 U+0308 (yes, it can be represented as a single codepoint `ë` U+00EB, but that's not required and isn't true for all possible combined characters). Emojis are a more common example of "characters" that are often made up of multiple codepoints. – Joachim Sauer May 18 '21 at 16:28
  • Sure, though, this isn't something that the OP specified and these 2 codepoints can still be rendered separately: ¨e – rudolfovic May 18 '21 at 16:50
  • Thanks! I have updated my question to include an example that doesn't work. – Sergey Slepov May 18 '21 at 18:03
  • This also assume `open('file,'r')` defaults to UTF-8 encoding which isn't true. It depends on the terminal configuration and is the value of `local.getpreferredencoding(False)`. – Mark Tolonen May 19 '21 at 04:24
1

The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.

You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.

import os
import struct

string = "我不喜欢你女朋友。你需要一个新的。"

with open('sample.bin','wb') as f:
    f.write(os.urandom(10))  # write 10 random bytes
    encoded = string.encode()
    f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
    f.write(encoded)                        # write string
    f.write(os.urandom(10))                 # 10 more random bytes

with open('sample.bin','rb') as f:
    print(f.read())  # show the raw data

# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
    f.seek(10)
    length = int.from_bytes(f.read(2),'big')
    result = f.read(length).decode()
    print(result)

# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
    # read 10 bytes and a big endian 16-bit value
    *other,header = struct.unpack('>10bH',f.read(12))
    result = f.read(length).decode()
    print(result)

Output:

b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。

If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:

with open('sample.bin','rb') as f:
    f.seek(12)
    c = codecs.getreader('utf8')(f)
    print(c.read(1))

Output:

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251