0

I have a subtitle file consisting of utf-8 with Chinese characters. In fact, it's tiny so here is the file.

So far I've managed to read the file using

with open(path) as f:
    text = f.read().decode('utf-8-sig').encode('utf-8')
    print text[:100]

All I get is the usual mis-encoding mess:

1
00:00:20,160 --> 00:00:22,660
派拉蒙电影公å¸

2
00:00:32,160 --> 00:00:36,660
åŽçº³å…„弟ç

I've set chcp 65001 in cmd.exe and then ran the py script. What am I doing wrong?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Lucidnonsense
  • 1,195
  • 3
  • 13
  • 35
  • 1
    You are printing UTF-8 encoded bytes to a console that is expecting CP-1252 characters, by the looks of it. – Martijn Pieters May 02 '15 at 12:46
  • The first string started with `派拉蒙电影公` but is incomplete (not all CP1252 codepoints are printable). The other contains `纳兄弟` but is again incomplete due to lost codepoints. – Martijn Pieters May 02 '15 at 12:49
  • Using the `ftfy` package sloppy CP1252 codecs to produce the same bytes I get `u'\xe6\xb4\xbe\xe6\u2039\u2030\xe8\u2019\u2122\xe7\u201d\xb5\xe5\xbd\xb1\xe5\u2026\xac\xe5\x8f\xb8'` and `u'\xe5\x8d\u017d\xe7\xba\xb3\xe5\u2026\u201e\xe5\xbc\u0178\xe7\u201d\xb5\xe5\xbd\xb1\xe5\u2026\xac\xe5\x8f\xb8'` respectively, which happen to match your output exactly. – Martijn Pieters May 02 '15 at 12:55
  • Since this is not a Python issue, I duped you to the canonical *output UTF-8 to Windows Console* question, since that is the only thing we can tell you to do. – Martijn Pieters May 02 '15 at 12:57
  • I don't understand what is happening. What are these lost codepoints? What does the encoding with ftfy mean? And I had previously looked at that page? Surely, there is a way to display it? – Lucidnonsense May 02 '15 at 13:02
  • Not all bytes in the UTF-8 encoded result map to CP-1252 characters, so those are lost when printing. I used the original file to reproduce what the console was trying to display. `ftfy` is a Python library to deal with these kinds of mis-applied codecs, you don't need to use it here. Your console is still not configured correctly, you need to triple-check you have the right configuration, Python experts cannot help you with that problem. – Martijn Pieters May 02 '15 at 13:04
  • I've changed the console's map to 65001 though, so shouldn't it display? I'm still using the lucida font. Nothing is working. – Lucidnonsense May 02 '15 at 13:08
  • @Lucidnonsense, please do not follow the advice of the linked answer. Suggesting codepage 65001 is really misinformed advice. To use Unicode in the Windows console, consider switching to Python 3 and using [win-unicode-console](https://github.com/Drekin/win-unicode-console). – Eryk Sun May 02 '15 at 13:22
  • @eryksun: The `win-unicode-console` package fixes specific edge cases; under Python 2 setting the codepage can be made to work. There are apparently some [Python 2 workarounds possible](http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/3259271). Yes, the Windows console and UTF-8 is a black morass of problems. – Martijn Pieters May 02 '15 at 13:46
  • @MartijnPieters, CP65001 is broken. I've attached a debugger to conhost.exe and watched it fail. When encoding and decoding it assumes the string is ANSI, with one byte per character, and it doesn't check for failure. This leads to buggy output since it misreports the number of bytes written, and a far worse problem with input. When encoding to multibyte UTF-8 on the server side, `WideCharToMultibyte` fails because the buffer is too small, but `conhost!SrvReadConsole` blindly returns that it successfully read 0 bytes to the client side (`ReadConsoleA`). That's `EOF`, so the REPL just exits. – Eryk Sun May 02 '15 at 14:22

0 Answers0