7

Many text encodings have the property that you can go through encoded text backwards and still be able to decode it. ASCII, UTF-8, UTF-16, and UTF-32 all have this property. This lets you do handy things like read the last line of a file without reading all the lines before it, or go backwards a few lines from your current position in a file.

Unfortunately, Python doesn't seem to come with any way to decode a file backwards. You can't read backwards, or seek by character quantity in an encoded file. The decoders in the codecs module support incremental decoding forwards, but not backwards. There doesn't seem to be any "UTF-8-backwards" codec I could feed UTF-8 bytes to in reverse order.

I could probably implement the codec-dependent character boundary synchronization myself, read binary chunks backward, and feed properly-aligned chunks to appropriate decoders from the codecs module, but that sounds like the kind of thing where a non-expert would miss some subtle detail and not notice the output is wrong.

Is there any simple way to decode text backward in Python with existing tools?


Several people appear to have missed the point that reading the entire file to do this defeats the purpose. While I'm clarifying things, I might as well add that this needs to work for variable-length encodings, too. UTF-8 support is a must.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Possible duplicate of [Read a file in reverse order using python](http://stackoverflow.com/questions/2301789/read-a-file-in-reverse-order-using-python) – gravity Apr 12 '16 at 19:41
  • @gravity: That reads the entire file. I'm specifically trying not to do that. – user2357112 Apr 12 '16 at 19:42
  • There's a specific community wiki answer there that involves reading in chunks. Please take a look at it at this direct link: http://stackoverflow.com/questions/260273/most-efficient-way-to-search-the-last-x-lines-of-a-file-in-python/260433#260433 – gravity Apr 12 '16 at 19:44
  • @gravity: That doesn't work with Unicode. – user2357112 Apr 12 '16 at 19:47
  • @user2357112 - You've answered your own question: "*implement the codec-dependent character boundary synchronization myself, read binary chunks backward, and feed properly-aligned chunks to appropriate decoders from the codecs module,*" That's going to be the simplest way. – Robᵩ Apr 12 '16 at 19:49
  • @Robᵩ: It definitely looks that way, but I'm hoping there's something I missed. – user2357112 Apr 12 '16 at 19:50
  • 2
    P.s. the UTF-8 boundary test is easy. The first byte of a chunk must not satisfy `(x & 0xc0) == 0x80`. – Robᵩ Apr 12 '16 at 19:52
  • if all you want is to read *lines* from the end of a file without loading the entire file into memory then the implementation may be simple: [read lines e.g., using `mmap` (replace `b'\n'` for non-ascii -based encodings and use `.rfind()` for speed)](http://stackoverflow.com/a/6813975/4279) and call `line.decode(encoding)` on each line. – jfs Apr 14 '16 at 09:46

1 Answers1

5

Absent a general-purpose solution, here is one specific to utf-8:

def rdecode(it):
    buffer = []
    for ch in it:
        och = ord(ch)
        if not (och & 0x80):
            yield ch.decode('utf-8')
        elif not (och & 0x40):
            buffer.append(ch)
        else:
            buffer.append(ch)
            yield ''.join(reversed(buffer)).decode('utf-8')
            buffer = []

utf8 = 'ho math\xc4\x93t\xc4\x93s hon \xc4\x93gap\xc4\x81 ho I\xc4\x93sous'
print utf8.decode('utf8')
for i in rdecode(reversed(utf8)):
    print i,
print ""

Result:

$ python x.py 
ho mathētēs hon ēgapā ho Iēsous
s u o s ē I   o h   ā p a g ē   n o h   s ē t ē h t a m   o h 
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • That looks like what I was thinking of for the "implement it myself" case, although it doesn't have any of the chunking optimization you'd want for operating on real files. I guess a lot of the work I didn't want to deal with was really in multiple codec support and writing a convenient, efficient file object that supports `read`-ing forwards and backwards and backwards iteration; for just UTF-8, the decoding itself isn't too bad. – user2357112 Apr 12 '16 at 20:25