This character - ㎜ - raises a UnicodeEncodeError

Question

I am using a Python script to convert files from gb2312 to utf-8. This character messes everything: ㎜ (it is one symbol, not "mm").

text = '㎜'
text.encode(encoding='gb2312')

raises this error:

UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence

I can use workaround by text.replace('㎜', 'mm'). But what if there are others such characters? What is wrong with it? Why it is so special?

Is there a way to make Python treat it as any other character?

You say you're converting from `gb2312` to `utf-8`, but the code you show converts *from* the internal Python encoding (which supports arbitrary Unicode characters), *to* `gb2312`. This may in fact be your problem, or you might've just showed the wrong part of the code. Please clarify. — zwol, Nov 25 '12 at 16:57
I just simplified it. It is the same error here. `file_old = open('1.php', mode='r', encoding='gb2312') file_new = open('2.php', mode='w', encoding='utf-8') file_new.write(file_old.read())` — Qiao, Nov 25 '12 at 16:59
Digging in a bit more, the problem character is `U+339C SQUARE MM`, which is *not* representable in GB2312 per http://www.fileformat.info/info/charset/GB2312/list.htm . Are you *certain* that your input file is actually encoded in GB2312? And please show us your original script. — zwol, Nov 25 '12 at 17:01
I just noticed that the character in your error message is `U+040B` ( CYRILLIC CAPITAL LETTER TSHE), not `U+339C`. There *are* some Cyrillic letters in GB2312 but that's not one of them. I think we need to see the actual contents of `1.php` -- please upload it somewhere, *unmodified*, and edit a link into your question; *do not* attempt to edit the contents of the file into your question. — zwol, Nov 25 '12 at 17:07
here it is - `http://chengyangxj.com/1.php`. It is just a file with `GB2312` encoding that contains `㎜`. — Qiao, Nov 25 '12 at 17:11
For the record, your original script does appear to do the job you say you want to do. But as a sanity check, if you are on a Linux, OSX, or BSD system, try running the command `iconv -f gb2312 -t utf-8 < 1.php > 2.php` and see what error messages that produces. — zwol, Nov 25 '12 at 17:11
I am on win7-64 now, can't check on other os. I will just hope, that it is the only character that rise this error. — Qiao, Nov 25 '12 at 17:14
It is not possible for a file that is actually in GB2312 encoding to contain the character `㎜`, just to be 100% clear. — zwol, Nov 25 '12 at 17:30
@Zack why it is not possible? If you can just create empty file with GB2312 encoding and insert `㎜` into it. Will your computer blow up? ) — Qiao, Nov 25 '12 at 17:38
It's not possible because GB2312 has no code point for `㎜`. You can *label* a file containing any byte sequence you want as "encoded in GB2312", but that does not make the label correct. — zwol, Nov 25 '12 at 17:45
@Qiao: No, it won't blow up. You'll get an error as the file no longer is in gb2312. — Lennart Regebro, Nov 25 '12 at 18:04

zwol · Accepted Answer · 2012-11-25T17:54:24.627

OK, so, I downloaded the file 1.php and ran your original script on it and I get a different error mesage:

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
  illegal multibyte sequence

The bytes in the file at offsets 99 and 100 are A9 4C in that order. That is neither a valid GB2312 nor a valid UTF-8 encoding of anything. I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding. If you need to just bull through all such problems, you can use errors='replace' and mode='rU' (the latter makes Python understand your DOS newlines).

file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')

This will insert U+FFFD REPLACEMENT CHARACTER in place of anything it can't decode, and continue. This destroys data; first try to figure out what the real encoding of the file is.

By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is

<!doctype html>
<html><head>
  <meta charset="utf-8">

Concise, standard compliant, and tested to work all the way back to IE6.

EDIT: On further investigation, GB2312 is a character set, not an encoding. There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character 呶. (I do not know any of the languages that use Chinese characters; does that make more sense in context than ㎜?)

Python and iconv assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise. If I modify your script to read

file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
file_new=open('2.php', mode='w', encoding='utf-8')
file_new.write(file_old.read())

then it executes without error on the 1.php you provided.

EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312"> is pretend you wrote <meta charset="gbk">. GBK is a superset of GB2312 that does include the ㎜ character. Python, however, treats GB2312 per its original definition. So what you really want in order for your conversion to match the original file is

file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')

At least `errors='replace'` can save time from messing with `catch except`. html is just random here, only for setting charset for browser showing it right. — Qiao, Nov 25 '12 at 17:41
big5 is for traditional hieroglyphs. Can't use it instead of GB2312. `呶` has no connection with `㎜`. — Qiao, Nov 25 '12 at 17:49
We should get `㎜` in output file, anything other is not right. But it seams, that only `㎜` has this problem. All files converted successfully after replacing it to `mm`. — Qiao, Nov 25 '12 at 17:51
`gbk` is solution! Thank you. That is very useful information. — Qiao, Nov 25 '12 at 17:58

This character - ㎜ - raises a UnicodeEncodeError

1 Answers1