5

I am using a Python script to convert files from gb2312 to utf-8. This character messes everything: (it is one symbol, not "mm").

text = '㎜'
text.encode(encoding='gb2312')

raises this error:

UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence

I can use workaround by text.replace('㎜', 'mm'). But what if there are others such characters? What is wrong with it? Why it is so special?

Is there a way to make Python treat it as any other character?

dda
  • 6,030
  • 2
  • 25
  • 34
Qiao
  • 16,565
  • 29
  • 90
  • 117
  • 4
    You say you're converting from `gb2312` to `utf-8`, but the code you show converts *from* the internal Python encoding (which supports arbitrary Unicode characters), *to* `gb2312`. This may in fact be your problem, or you might've just showed the wrong part of the code. Please clarify. – zwol Nov 25 '12 at 16:57
  • I just simplified it. It is the same error here. `file_old = open('1.php', mode='r', encoding='gb2312') file_new = open('2.php', mode='w', encoding='utf-8') file_new.write(file_old.read())` – Qiao Nov 25 '12 at 16:59
  • 1
    Digging in a bit more, the problem character is `U+339C SQUARE MM`, which is *not* representable in GB2312 per http://www.fileformat.info/info/charset/GB2312/list.htm . Are you *certain* that your input file is actually encoded in GB2312? And please show us your original script. – zwol Nov 25 '12 at 17:01
  • I just noticed that the character in your error message is `U+040B` ( CYRILLIC CAPITAL LETTER TSHE), not `U+339C`. There *are* some Cyrillic letters in GB2312 but that's not one of them. I think we need to see the actual contents of `1.php` -- please upload it somewhere, *unmodified*, and edit a link into your question; *do not* attempt to edit the contents of the file into your question. – zwol Nov 25 '12 at 17:07
  • here it is - `http://chengyangxj.com/1.php`. It is just a file with `GB2312` encoding that contains `㎜`. – Qiao Nov 25 '12 at 17:11
  • For the record, your original script does appear to do the job you say you want to do. But as a sanity check, if you are on a Linux, OSX, or BSD system, try running the command `iconv -f gb2312 -t utf-8 < 1.php > 2.php` and see what error messages that produces. – zwol Nov 25 '12 at 17:11
  • I am on win7-64 now, can't check on other os. I will just hope, that it is the only character that rise this error. – Qiao Nov 25 '12 at 17:14
  • It is not possible for a file that is actually in GB2312 encoding to contain the character `㎜`, just to be 100% clear. – zwol Nov 25 '12 at 17:30
  • @Zack why it is not possible? If you can just create empty file with GB2312 encoding and insert `㎜` into it. Will your computer blow up? ) – Qiao Nov 25 '12 at 17:38
  • 3
    It's not possible because GB2312 has no code point for `㎜`. You can *label* a file containing any byte sequence you want as "encoded in GB2312", but that does not make the label correct. – zwol Nov 25 '12 at 17:45
  • @Qiao: No, it won't blow up. You'll get an error as the file no longer is in gb2312. – Lennart Regebro Nov 25 '12 at 18:04

1 Answers1

11

OK, so, I downloaded the file 1.php and ran your original script on it and I get a different error mesage:

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
  illegal multibyte sequence

The bytes in the file at offsets 99 and 100 are A9 4C in that order. That is neither a valid GB2312 nor a valid UTF-8 encoding of anything. I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding. If you need to just bull through all such problems, you can use errors='replace' and mode='rU' (the latter makes Python understand your DOS newlines).

file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')

This will insert U+FFFD REPLACEMENT CHARACTER in place of anything it can't decode, and continue. This destroys data; first try to figure out what the real encoding of the file is.

By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is

<!doctype html>
<html><head>
  <meta charset="utf-8">

Concise, standard compliant, and tested to work all the way back to IE6.

EDIT: On further investigation, GB2312 is a character set, not an encoding. There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character . (I do not know any of the languages that use Chinese characters; does that make more sense in context than ?)

Python and iconv assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise. If I modify your script to read

file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
file_new=open('2.php', mode='w', encoding='utf-8')
file_new.write(file_old.read())

then it executes without error on the 1.php you provided.

EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312"> is pretend you wrote <meta charset="gbk">. GBK is a superset of GB2312 that does include the character. Python, however, treats GB2312 per its original definition. So what you really want in order for your conversion to match the original file is

file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')
zwol
  • 135,547
  • 38
  • 252
  • 361
  • At least `errors='replace'` can save time from messing with `catch except`. html is just random here, only for setting charset for browser showing it right. – Qiao Nov 25 '12 at 17:41
  • big5 is for traditional hieroglyphs. Can't use it instead of GB2312. `呶` has no connection with `㎜`. – Qiao Nov 25 '12 at 17:49
  • We should get `㎜` in output file, anything other is not right. But it seams, that only `㎜` has this problem. All files converted successfully after replacing it to `mm`. – Qiao Nov 25 '12 at 17:51
  • `gbk` is solution! Thank you. That is very useful information. – Qiao Nov 25 '12 at 17:58