0

I want to read Chinese file through python code. But i got a messy output.

Following is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

with open('1.doc', 'r+') as f:
    text = f.readlines()
    print text

Output:

\x01\x00\x00\xfe\xff\xff\xffy\x01\x00\x00z\x01\x00\x00{\x01\x00\x00|\x01\x00\x00}\x01\x00\x00~\x01\x00\x00\x7f\x01\x00\x00\x80\x01\x00\x00\x81\x01\x00\x00\x82\x01\

I know that it must have some encode or decode problems in there. But i don't know how to figure it out.

ketan
  • 19,129
  • 42
  • 60
  • 98
Peter Tsung
  • 915
  • 2
  • 10
  • 20
  • 1
    What were you expecting to get? – 一二三 Oct 06 '15 at 09:33
  • @一二三 the file's content is Chinese. I want it to display Chinese. – Peter Tsung Oct 06 '15 at 09:34
  • 1
    If you're opening a MS Word document, you're going to have to either convert it first manually, or if you're on Windows use the COM interface as described http://stackoverflow.com/a/32049165/69893 there. – Christian Witts Oct 06 '15 at 09:37
  • @ChristianWitts I have tried to convert it to txt type. And the output changed to something like this \\u21516 \\u30340 \\u27835 \\u23398 \\u24577 \\u24230 \\u65292 \\u35848 \\u35848 \\u20320 \\u33719 \\u24471 \\u30340 \\u21551 \\u31034 \\u12290 \\\n' – Peter Tsung Oct 06 '15 at 09:41
  • related: [extracting text from MS word files in python](http://stackoverflow.com/q/125222/4279) – jfs Oct 08 '15 at 08:00
  • @曾锐鸿: update your question or ask a new one: include the code that produces the new output and the output itself (exactly as you see it). – jfs Oct 08 '15 at 08:02

2 Answers2

0

This isn't anything to do with Chinese. This is a Word doc, which is a binary file format. You can't just read it via readlines: you'll need to convert it from that binary file format. A library like docx will help.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
0

to display Unicode you system chars you sistem must my configured. Check what is configuration of you environment with sys.getdefaultencoding(), if is not outputting utf-8 you will not get chinese displayed. if you on Window read with encoding='cp1252', but first check the environment.

LetzerWille
  • 5,355
  • 4
  • 23
  • 26