When you have a problem with non-ASCII characters/bytes, it is rather unhelpful to print them to your console and them copy/past that into your question. What you see is quite often NOT what you have got. You should use the built-in repr()
function [Python 3.x: ascii()
] to show your data as unambigously as possible.
Do this:
python -c "print repr(open('shiftjis.txt', 'rb').read())"
and copy/paste the results into an edit your question.
Reverse-engineering your data while awaiting enlightenment: A Windows code page would have to be a good suspect, with cp1252
the most usual. As @Mark Tolonen has shown, cp1252
almost fits, with one error. Further investigation shows that the other cp125x
encodings produce 2, 3, or 5 errors. AFAIK only the cp125x
encodings would map something that looks like a comma (actually U+201A SINGLE LOW-9 QUOTATION MARK) to the shift-jis lead byte \x82
. I conclude that the offender is cp1252
, and that the error is caused by damage in transit.
Another possibility is that the underlying original encoding is not shift-jis
but its superset, Microsoft's cp932
as used on Japanese Windows. However the problematic sequence '\x82@'
is not valid in cp932
either. In any case, if the file(s) that you want to process came from a Japanese Windows machine, it would be better to use cp932
than shift-jis
.
It is not obvious from your question and your code what you want to do nor why you want to do it with byte ranges instead of just decoding your data to Unicode. I don't use pyparsing
but it seems highly likely that the subranges that you are feeding it are malformed.
Below is an example of how you could tokenise your input using regular expressions. Note that the pyparsing syntax is slightly different (\0xff
instead of Python's `\xff').
Code:
import re, unicodedata
input_bytes = '\x82s\x82\x88\x82\x89\x82\x93@\x82\x89\x82\x93@\x82@\x82\x93\x82\x88\x82\x89\x82\x86\x82\x94[\x82\x8a\x82\x89\x82\x93@\x82\x93\x82\x94\x82\x92\x82\x89\x82\x8e\x82\x87B'
p_ascii = r'[\x00-\x7f]'
p_hw_katakana = r'[\xa1-\xdf]' # half-width Katakana
p_jis208 = r'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]'
p_bad = r'.' # anything else
kinds = ['jis208', 'ascii', 'hwk', 'bad']
re_matcher = re.compile("(" + ")|(".join([p_jis208, p_ascii, p_hw_katakana, p_bad]) + ")")
for mobj in re_matcher.finditer(input_bytes):
s = mobj.group()
us = s.decode('shift-jis', 'replace')
print ("%-6s %-9s %-10r U+%04X %s"
% (kinds[mobj.lastindex - 1], mobj.span(), s, ord(us), unicodedata.name(us, '<no name>'))
)
Output:
jis208 (0, 2) '\x82s' U+FF34 FULLWIDTH LATIN CAPITAL LETTER T
jis208 (2, 4) '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (4, 6) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (6, 8) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (8, 9) '@' U+0040 COMMERCIAL AT
jis208 (9, 11) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (11, 13) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (13, 14) '@' U+0040 COMMERCIAL AT
jis208 (14, 16) '\x82@' U+FFFD REPLACEMENT CHARACTER
jis208 (16, 18) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (18, 20) '\x82\x88' U+FF48 FULLWIDTH LATIN SMALL LETTER H
jis208 (20, 22) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (22, 24) '\x82\x86' U+FF46 FULLWIDTH LATIN SMALL LETTER F
jis208 (24, 26) '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
ascii (26, 27) '[' U+005B LEFT SQUARE BRACKET
jis208 (27, 29) '\x82\x8a' U+FF4A FULLWIDTH LATIN SMALL LETTER J
jis208 (29, 31) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (31, 33) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
ascii (33, 34) '@' U+0040 COMMERCIAL AT
jis208 (34, 36) '\x82\x93' U+FF53 FULLWIDTH LATIN SMALL LETTER S
jis208 (36, 38) '\x82\x94' U+FF54 FULLWIDTH LATIN SMALL LETTER T
jis208 (38, 40) '\x82\x92' U+FF52 FULLWIDTH LATIN SMALL LETTER R
jis208 (40, 42) '\x82\x89' U+FF49 FULLWIDTH LATIN SMALL LETTER I
jis208 (42, 44) '\x82\x8e' U+FF4E FULLWIDTH LATIN SMALL LETTER N
jis208 (44, 46) '\x82\x87' U+FF47 FULLWIDTH LATIN SMALL LETTER G
ascii (46, 47) 'B' U+0042 LATIN CAPITAL LETTER B
Note 1: You DON'T need to loop around and join O(N**2) character ranges.
If "jascii" just means "FULLWIDTH LATIN (CAPITAL|SMALL) LETTER [A-Z]" (a) your net is far too large (b) you can do that easily using UNICODE character ranges instead of BYTE ranges (after of course decoding your data).