The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.
You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.
import os
import struct
string = "我不喜欢你女朋友。你需要一个新的。"
with open('sample.bin','wb') as f:
f.write(os.urandom(10)) # write 10 random bytes
encoded = string.encode()
f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
f.write(encoded) # write string
f.write(os.urandom(10)) # 10 more random bytes
with open('sample.bin','rb') as f:
print(f.read()) # show the raw data
# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
f.seek(10)
length = int.from_bytes(f.read(2),'big')
result = f.read(length).decode()
print(result)
# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
# read 10 bytes and a big endian 16-bit value
*other,header = struct.unpack('>10bH',f.read(12))
result = f.read(length).decode()
print(result)
Output:
b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。
If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:
with open('sample.bin','rb') as f:
f.seek(12)
c = codecs.getreader('utf8')(f)
print(c.read(1))
Output:
我