Replace unicode characters only once in Python

Question

I'm trying to create a small script that replaces a set of characters in a file like this:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

and it works ok for the first run.

So if I have a file containing Eºti sigur cã vrei sã ºtergi fiºierele?, after running the script on that file I get Ești sigur că vrei să ștergi fișierele? which is what I want. But if I run it multiple times I get:

EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

EĂÂti sigur cĂÂ vrei sĂÂ ĂÂtergi fiĂÂierele?

EÄÂĂÂti sigur cÄÂĂÂ vrei sÄÂĂÂ ÄÂĂÂtergi fiÄÂĂÂierele?

EĂÂĂÂÄÂĂÂti sigur cĂÂĂÂÄÂĂÂ vrei sĂÂĂÂÄÂĂÂ ĂÂĂÂÄÂĂÂtergi fiĂÂĂÂÄÂĂÂierele?

And I don't understand why. How does it find some characters that don't exist anymore in the file (ã, º, etc.) to be able to replace them? And why is it even replacing them with some other characters?

Alastair McCormack · Accepted Answer · 2015-03-29T21:05:44.747

3

Simple - it's because on the first run you're reading ISO-8859-1 and writing UTF-8. Then on the second run you're doing exactly the same despite the input is now UTF-8 not ISO-8859-1. On subsequent runs the search and replace is no longer working.

This test mimics your 2nd iteration - Interpreting UTF-8 as ISO-8859-1 :

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

The next iteration looks like:

print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?

Heed @Daniel's advice to decode once, replace Unicode with Unicode then encode once. I've also been informed that it's best to use io.open() rather than codecs, as its Python 3 compatible and solves a problem with universal new lines.

edited Mar 29 '15 at 21:05

answered Mar 29 '15 at 20:22

Alastair McCormack

26,573
8
77
100

And is it possible to check a file's encoding? So if it's not `ISO-8859-1`, it skips the replacing. – Iulian Onofrei Mar 29 '15 at 21:16
Accurately detecting character encoding is very difficult. If you know what the text is supposed to say you could compare the input to a string in the correct encoding. Alternativly, there's some good Python libs which try to guess the encoding and may be sufficient for what you need: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – Alastair McCormack Mar 29 '15 at 21:25
1

@Iulian, write out the "byte object mark" at the beginning of your output file. It's not required or recommended for utf-encoded files, but it is invisible to text editors and it's a standard way to identify unicode. Better yet: Change your workflow so you know how your files are encoded without having to check. – alexis Mar 29 '15 at 21:26
@AlastairMcCormack, I think that the easiest solution would be to leave the original file intact, so if I mistakenly run my script on `subtitle.srt` it would produce `subtitle_fixed.srt` every time, replacing _(with no gain, nor harm)_ the old one. – Iulian Onofrei Mar 29 '15 at 21:26
Google "Unicode BOM". Windows editors add it to the start of unicode files, so you're safe adding it too. – alexis Mar 29 '15 at 21:28
@alexis It's a novel idea (I upvoted you) but it's not entirely safe, especially when dealing with .srt for closed platforms - not all devices understand Byte **Order** Marks. – Alastair McCormack Mar 29 '15 at 21:30
@Alastair I agree (I even said it's not recommended), but since he's on Windows it's a pretty safe bet. Still, it's far better to adopt a saner filename scheme... – alexis Mar 29 '15 at 21:40
@alexis, So something like my comment about `subtitle_fixed.srt`? – Iulian Onofrei Mar 29 '15 at 21:42
I've just looked at the .srt Wikipedia page. It looks like the safe thing to do (when dealing with Western languages) is to encode to `windows-1252` rather than utf-8: http://en.wikipedia.org/wiki/SubRip#Text_encoding – Alastair McCormack Mar 29 '15 at 21:45
Just noticed I somehow typed "object" instead of "order" mark. Embarassing... Thanks for catching it, @Alastair! – alexis Mar 29 '15 at 21:59
I won't bother to guess what the wikipedia entry is trying to say, but CP-1252 is an 8-bit encoding. Only the ASCII part (i.e., the low page) is "compatible" with UTF-8. Don't do it. – alexis Mar 29 '15 at 22:03

score 0 · Answer 2 · answered Mar 29 '15 at 19:44

Don't work with encoded content. Only encode when writing the new file:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

Still keeps replacing like before. – Iulian Onofrei Mar 29 '15 at 19:47 — Iulian Onofrei, Mar 29 '15 at 19:47

score 0 · Answer 3 · answered Mar 29 '15 at 20:31

It is incorrect to use "ISO-8859-1" character encoding on "utf-8" content: the very first time you run your script it takes a text file (presumably "ISO-8859-1" encoded) and saves it as "utf-8" while replacing certain Unicode characters.

Then you run the conversion the second time then it takes "utf-8" content and tries to interpret it as "ISO-8859-1" that is wrong.

To avoid the confusion make the replacements separately from the changing of the character encoding. Thus the replacements would be idempotent.

To make the replacements, you could use fileinput module and unicode.translate():

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin

replacements = {
    u"ã": u"ă",
    u"Ã": u"Ă",
    u"º": u"ș",
    u"ª": u"Ș",
    u"þ": u"ț",
    u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
    print(line.translate(replacements))

To control the encoding of the output file, you could set PYTHONIOENCODING e.g., in bash:

$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8

This command both replaces the characters and transcodes the input from "iso-8859-1" to "utf-8".

If input filename.txt is already broken (no single character encoding correctly decodes it) then you could try ftfy module to fix common encoding errors:

$ ftfy filename.txt >filename.utf8.txt

Using this code snippet I get `UnicodeEncodeError: 'charmap' codec can't encode character u'\u0219' in position 4: character maps to ` in console _(running on Windows)_. — Iulian Onofrei, Mar 29 '15 at 21:19
@IulianOnofrei set PYTHONIOENCODING environment variable (using cmd.exe -specific syntax) and *redirect* output to a file as shown in the answer (using `>` opetator). — jfs, Mar 29 '15 at 21:47

Replace unicode characters only once in Python

3 Answers3