4

Im' trying read a docx file in python 2.7 with this code:

import docx
document = docx.Document('sim_dir_administrativo.docx')
    docText = '\n\n'.join([
        paragraph.text.encode('utf-8') for paragraph in document.paragraphs])

And then I'm trying to decode the string inside the file with this code, because I have some special characters (e.g. ã):

print docText.decode("utf-8")

But, I'm getting this error:

    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
 494457: character maps to <undefined>

How can I solve this?

user3511563
  • 397
  • 2
  • 5
  • 18

1 Answers1

7

The print function can only print characters that are in your local encoding. You can find out what that is with sys.stdout.encoding. To print with special characters you must first encode to your local encoding.

# -*- coding: utf-8 -*-
import sys

print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')

This code snippet was taken from this stackoverflow response.

Community
  • 1
  • 1
Andrew Johnson
  • 3,078
  • 1
  • 18
  • 24
  • Thanks, I already had see this response, but when I try the suggestion I get this error: return codecs.charmap_encode(input,errors,encoding_map) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128) – user3511563 Jul 25 '14 at 03:24
  • I can reproduce that error if you try to convert a utf-8 string to a bytestring directly. First you must convert it to unicode with `.decode("utf-8")` then convert it to a local bytestring with `.encode(sys.stdout.encoding, errors='replace')`. – Andrew Johnson Jul 25 '14 at 03:28
  • 1
    But my initial error was exactly not being able to decode to utf8. – user3511563 Jul 25 '14 at 03:34
  • I think the decode from utf8 to unicode worked, but then the print failed because you can't print unicode. That is why you need to encode to local encoding after the decode. – Andrew Johnson Jul 25 '14 at 03:39