1

I'm trying to replace a substring in a Word file, using the following command sequence in Python. The code alone works perfectly fine - even with the exact same Word file, but when embedding it in a larger scale project structure, it throws an error at exact that spot. I'm clueless as to what causes it, as it seemingly has nothing to do with the code and seems unreproducible for me.

Side note: I know what's causing the Error, it's a german 'ü' in the Word file, but it's needed and removing it doesn't seem like the right solution, if the code works standalone.

#foo.py
from bar import make_wordm
def main(uuid):
    with open('foo.docm', 'w+') as f:
        f.write(make_wordm(uuid=uuid))

main('1cb02f34-b331-4616-8d20-aa1821ef0fbd')

foo.py imports bar.py for doing the heavy lifting.

#bar.py
import tempfile
import shutil
from cStringIO import StringIO
from zipfile import ZipFile, ZipInfo

WORDM_TEMPLATE='./res/template.docm'
MODE_DIRECTORY = 0x10

def zipinfo_contents_replace(zipfile=None, zipinfo=None,
                             search=None, replace=None):
    dirname = tempfile.mkdtemp()
    fname = zipfile.extract(zipinfo, dirname)
    with open(fname, 'r') as fd:
        contents = fd.read().replace(search, replace)
    shutil.rmtree(dirname)
    return contents

def make_wordm(uuid=None, template=WORDM_TEMPLATE):
    with open(template, 'r') as f:
        input_buf = StringIO(f.read())
    output_buf = StringIO()
    output_zip = ZipFile(output_buf, 'w')

    with ZipFile(input_buf, 'r') as doc:
        for entry in doc.filelist:
            if entry.external_attr & MODE_DIRECTORY:
                continue

            contents = zipinfo_contents_replace(zipfile=doc, zipinfo=entry,
                                        search="00000000-0000-0000-0000-000000000000"
                                        , replace=uuid)
            output_zip.writestr(entry, contents)
    output_zip.close()
    return output_buf.getvalue()

The following error is thrown when embedding the same code in a larger scale context:

ERROR:root:message
Traceback (most recent call last):
  File "FooBar.py", line 402, in foo_bar
    bar = bar_constructor(bar_theme,bar_user,uuid)
  File "FooBar.py", line 187, in bar_constructor
    if(main(uuid)):
  File "FooBar.py", line 158, in main
    f.write(make_wordm(uuid=uuid))
  File "/home/foo/FooBarGen.py", line 57, in make_wordm
    search="00000000-0000-0000-0000-000000000000", replace=uuid)
  File "/home/foo/FooBarGen.py", line 24, in zipinfo_contents_replace
    contents = fd.read().replace(search, replace)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)
INFO:FooBar:None

edit: Upon further examination and debugging, it seems like the variable 'uuid' is causing the issue. When giving the parameter as a fulltext string ('1cb02f34-b331-4616-8d20-aa1821ef0fbd'), instead of using the variable parsed from a JSON, it works perfectly fine.

edit2: I had to add uuid = uuid.encode('utf-8', 'ignore') and it works perfectly fine now.

  • On your `open`, you do not specify the encoding. If it work sometime, it do not mean that it is correct. Python mantra: "explicit is better then implicit". Note: large framework could activate "locales", so it could change the behaviour of local part (and ev. translate also trings) – Giacomo Catenazzi Apr 20 '18 at 11:41

3 Answers3

1

The problem is mixing Unicode and byte strings. Python 2 "helpfully" tries to convert from one to the other but defaults to using the ascii codec.

Here's an example:

>>> 'aeioü'.replace('a','b')  # all byte strings
'beio\xfc'
>>> 'aeioü'.replace(u'a','b') # one Unicode string and it converts...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 4: ordinal not in range(128)

You mentioned reading a UUID from JSON. JSON returns Unicode strings. Ideally read all text files decoding to Unicode, do all text processing in Unicode, and encode text files when writing back to storage. In your "larger framework" this could be a big porting job, but essentially use io.open with an encoding to read a file and decode to Unicode:

with io.open(fname, 'r', encoding='utf8') as fd:
    contents = fd.read().replace(search, replace)

Note that encoding should match the actual encoding of the files you are reading. That's something you'll have to determine.

A shortcut, as you've found in your edit, is to encode the UUID from JSON back to a byte string, but using Unicode to deal with text should be the goal.

Python 3 cleans up this process by making strings Unicode by default, and drops the implicit conversion to/from byte/Unicode strings.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Change this line:

with open(fname, 'r') as fd:

to this:

with open(fname, 'r', encoding='latin1') as fd:

The ascii coded can handle character codes between 0 and 127 inclusive. Your file contains the character code 0xc3, which is outside the range. You need to choose a different codec.

Paul Cornelius
  • 9,245
  • 1
  • 15
  • 24
  • Thanks! Is this for Python 3? Getting `'encoding' is an invalid keyword argument for this function` when trying to use it. Besides that, how come the UnicodeDecodeError is not thrown when using the code snippets exclusively? edit: To use the encoding option in Python2.7, use io.open - how ever I'm still getting `UnicodeEncodeError: 'ascii' codec can't encode characters in position 2721-2722: ordinal not in range(128)` – double_negative Apr 20 '18 at 09:52
  • Even when using `contents = fd.read().decode("latin1").encode("utf8").replace(search, replace)` the same error is thrown. This doesn't seem like the solution to as to why this is happening though anyways, as the code snippets work in standalone. – double_negative Apr 20 '18 at 10:02
0

All the times I've had a problem with special characters in the past I have resolved them by decoding to Unicode when reading and then encoding to utf-8 when writing back to a file.

I hope this works for you too.

For my solution I 've always used what I found in this presentation http://farmdev.com/talks/unicode/

So I would use this:

def to_unicode_or_bust(obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

Then on your code:

contents = to_unicode_or_bust(fd.read().replace(search, replace))

And then when writing it set encoding back to utf-8.

output_zip.writestr(entry, contents.encode('utf-8'))

I didn't reproduce your issue so it's just a suggestion. Hope it works

ThemThem
  • 38
  • 5
  • 1
    Thanks for your suggestion, however I'm still getting `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2722: ordinal not in range(128)`. Unfortunately this workaround tries to tackle an issue that should not exist in the first place. – double_negative Apr 20 '18 at 10:08
  • Could you maybe try as well: `contents = to_unicode_or_bust(fd.read()).replace(search, replace)` – ThemThem Apr 20 '18 at 10:12
  • Thanks for the effort. `UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2722: ordinal not in range(128)` I tried every encode/decode procedure out there on Stackoverflow. I don't really think that that's the fix to be honest, as it is not clear to my why the error is thrown in the first place, given that the code-snippets work. – double_negative Apr 20 '18 at 10:19
  • It can't be something else other than an encoding/decoding issue since removing makes the whole thing work, right? Don't know if you are on a Windows system and you might be getting [this issue](https://stackoverflow.com/questions/5760936/handle-wrongly-encoded-character-in-python-unicode-string) If I were you I would decode in all reads and encode in all writes. – ThemThem Apr 20 '18 at 10:25
  • Removing the 'ü' makes the whole thing work in the larger project, yes - but keeping the 'ü' and running the same code exclusively/in a smaller project works as well and does not throw an encoding/decoding error. I'm on UNIX, but the post you linked was an interesting read nevertheless! – double_negative Apr 20 '18 at 10:29