Python 3- check if buffered out bytes form a valid char

Question

I am porting some code from python 2.7 to 3.4.2, I am struck at the bytes vs string complication.

I read this 3rd point in the wolf's answer

Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in binary mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;

So, when I buffer read a file (say - 1 byte each time) & the very first characters happens to be a 6-byte unicode how do I figure out how many more bytes to be read? Because if I do not read till the complete char, it will be skipped from processing; as next time read(x) will read x bytes relative to it's last position (i.e. halfway between it char's byte equivalent)

I tried the following approach:

import sys, os

def getBlocks(inputFile, chunk_size=1024):
    while True:
        try:
            data=inputFile.read(chunk_size)
            if data:
                yield data
            else:
                break
        except IOError as strerror:
            print(strerror)
            break

def isValid(someletter):
    try:
        someletter.decode('utf-8', 'strict')
        return True
    except UnicodeDecodeError:
        return False

def main(src):
    aLetter = bytearray()
    with open(src, 'rb') as f:
        for aBlock in getBlocks(f, 1):
            aLetter.extend(aBlock)
            if isValid(aLetter):
                # print("char is now a valid one") # just for acknowledgement
                # do more
            else:
                aLetter.extend( getBlocks(f, 1) )

Questions:

Am I doomed if I try fileHandle.seek(-ve_value_here, 1)
Python must be having something in-built to deal with this, what is it?
how can I really test if the program meets its purpose of ensuring complete characters are read (right now I have only simple english files)
how can I determine best chunk_size to make program faster. I mean reading 1024 bytes where first 1023 bytes were 1-byte-representable-char & last was a 6-byter leaves me with the only option of reading 1 byte each time

Note: I can't prefer buffered reading as I do not know range of input file sizes in advance

Hi OP, did you send me an email? I got a suspicious looking email linking to this question, hence the question. — Lekensteyn, Jan 15 '15 at 16:20
@Lekensteyn Yup, sorry if you didn't appreciate it. Newbie here. On a side note- could you point me to some good resources? — RinkyPinku, Jan 17 '15 at 04:04
Oh feel free to mail me, but these signals made me think it was a spam mail: (1) addressing a second non-personal looking gmail address (2) a generic greeting (often used by spammers) and a very limited email body. Besides the python documentation itself, I have no other suggestions for resources on Unicode handling. — Lekensteyn, Jan 17 '15 at 10:12

Mark Tolonen · Accepted Answer · 2015-01-15T09:34:07.070

2

The answer to #2 will solve most of your issues. Use an IncrementalDecoder via codecs.getincrementaldecoder. The decoder maintains state and only outputs fully decoded sequences:

#!python3
import codecs
import sys
byte_string = '\u5000\u5001\u5002'.encode('utf8')

# Get the UTF-8 incremental decoder.
decoder_factory = codecs.getincrementaldecoder('utf8')
decoder_instance = decoder_factory()

# Simple example, read two bytes at a time from the byte string.
result = ''
for i in range(0,len(byte_string),2):
    chunk = byte_string[i:i+2]
    result += decoder_instance.decode(chunk)
    print('chunk={} state={} result={}'.format(chunk,decoder_instance.getstate(),ascii(result)))
result += decoder_instance.decode(b'',final=True)
print(ascii(result))

Output:

chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result=''
chunk=b'\x80\xe5' state=(b'\xe5', 0) result='\u5000'
chunk=b'\x80\x81' state=(b'', 0) result='\u5000\u5001'
chunk=b'\xe5\x80' state=(b'\xe5\x80', 0) result='\u5000\u5001'
chunk=b'\x82' state=(b'', 0) result='\u5000\u5001\u5002'
'\u5000\u5001\u5002'

Note after the first two bytes are processed the internal decoder state just buffers them and appends no characters to the result. The next two complete a character and leave one in the internal state. The last call with no additional data and final=True just flushes the buffer. It will raise an exception if there is an incomplete character pending.

Now you can read your file in whatever chunk size you want, pass them all through the decoder and be sure that you only have complete code points.

Note that with Python 3, you can just open the file and declare the encoding. The chunk you read will actually be processed Unicode code points using an IncrementalDecoder internally:

input.csv (saved in UTF-8 without BOM)

我是美国人。
Normal text.

code

with open('input.txt',encoding='utf8') as f:
    while True:
        data = f.read(2)   # reads 2 Unicode codepoints, not bytes.
        if not data: break
        print(ascii(data))

Result:

'\u6211\u662f'
'\u7f8e\u56fd'
'\u4eba\u3002'
'\nN'
'or'
'ma'
'l '
'te'
'xt'
'.'

edited Jan 15 '15 at 09:34

answered Jan 15 '15 at 03:55

Mark Tolonen

166,664
26
169
251

If I copy-paste-run it from a .py file it doesn't run; in terminal it runs, *sort-of*, because I am on windows & those chinese chars show up as ?????? in my terminal. Should I read file in text mode? – RinkyPinku Jan 15 '15 at 05:05
I tried reading as text & binary, please see [this](http://ideone.com/uESA6R). If byte_string *is in code* or *in external file read as text* both times I get a `UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to ` – RinkyPinku Jan 15 '15 at 05:14
Tried [this](http://ideone.com/MZkslt) too . . . same `UnicodeEncodeError` if I process the chienese file – RinkyPinku Jan 15 '15 at 06:45
Use an IDE or terminal that supports UTF-8. I'm on Windows as well and used PythonWin editor from the pywin32 extension. – Mark Tolonen Jan 15 '15 at 08:28
To work in the console you could also use `print(ascii(result))` which will display Unicode escape codes for non-ASCII characters: `'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'`. Use it in place of `repr(result)` in the format string also. – Mark Tolonen Jan 15 '15 at 08:31
I'm also on Windows and get `TypeError: non-empty format string passed to object.__format__` on the line `print('chunk={:13} state={:17} result={}'.format(chunk,decoder_instance.getstate(),repr(result)))`. – martineau Jan 15 '15 at 08:44
@martineau I also got the same error, I just commented it out as it wasn't really contributing other than giving regular output. – RinkyPinku Jan 15 '15 at 08:50
@martineau. I got that at the terminal too. Not sure what that error means. I just updated the script to use Unicode escapes so it keeps everything ASCII so a terminal can run it. – Mark Tolonen Jan 15 '15 at 08:54
@MarkTolonen Thanks sir; we don't have very good internet in my city, so I may not be able to download & test. But the issue still remains, i.e. if I try to fetch data from file – RinkyPinku Jan 15 '15 at 09:02
With the exact code above now? It doesn't print non-ASCII characters now. Here's a run at [ideone.com](http://ideone.com/8jVJ6x). – Mark Tolonen Jan 15 '15 at 09:07
No, sorry, I missed that you changed a number of other things, too. It looks like it's working now. – martineau Jan 15 '15 at 09:09
That `TypeError` seems like a bug. The second parameter to `format` is a `tuple`, but when I add format beyond simple `{}` such as `{:12}` to give it a field length the error occurs, but only in the console. My IDE works fine. – Mark Tolonen Jan 15 '15 at 09:11
Yes, I agree, the `TypeError` for `non-empty format string` for just a field width definitely seems bogus. Make sure your console font is set to a TrueType font like Lucida Console (see [this](http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using) question). – martineau Jan 15 '15 at 09:14
FWIW, see [_Change default code page of Windows console to UTF-8_](http://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8). – martineau Jan 15 '15 at 09:25
Yeah, with the exact code above, it works now. But fails if I save file with that string as utf-8 & the `open()` the file in `'rt', encoding='utf-8'`, I can't take input from terminal. I am now not getting that `TypeError` though – RinkyPinku Jan 15 '15 at 09:26
1

To read a file, use `rb` with no encoding. That will read raw bytes from the file to feed the decoder. Opening with `rt`,encoding='utf8'` will translate the raw bytes to Unicode strings. It actually uses the incrementer decoder internally so you wouldn't have to worry about partial characters that way :) – Mark Tolonen Jan 15 '15 at 09:29
I added an example. Using `open` with an encoding is probably exactly what you really need. – Mark Tolonen Jan 15 '15 at 09:40
Thanks. All this because python man pages said read()'s arguments is always bytes but your comment in line 3 clarifies all :) – RinkyPinku Jan 15 '15 at 12:52

Python 3- check if buffered out bytes form a valid char

1 Answers1

input.csv (saved in UTF-8 without BOM)

code

Result: