I'm trying to create a small script that replaces a set of characters in a file like this:
# coding=utf-8
import codecs
import os
import sys
args = sys.argv
if len(args) > 1:
subtitleFileName = args[1]
newSubtitleFileName = subtitleFileName + "_new"
replacePairs = {
u"ã": "ă",
u"Ã": "Ă",
u"º": "ș",
u"ª": "Ș",
u"þ": "ț",
u"Þ": "Ț",
}
if os.path.isfile(subtitleFileName):
oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")
subtitleContent = oldSubtitleFile.read()
subtitleContent = codecs.encode(subtitleContent, "utf-8")
for key, value in replacePairs.iteritems():
subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)
oldSubtitleFile.close()
newSubtitleFile = open(newSubtitleFileName, "wb")
newSubtitleFile.write(subtitleContent)
newSubtitleFile.close()
os.remove(subtitleFileName)
os.rename(newSubtitleFileName, subtitleFileName)
print "Done!"
else:
print "Missing subtitle file!"
else:
print "Missing arguments!"
and it works ok for the first run.
So if I have a file containing Eºti sigur cã vrei sã ºtergi fiºierele?
, after running the script on that file I get Ești sigur că vrei să ștergi fișierele?
which is what I want. But if I run it multiple times I get:
EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?
EĂÂti sigur cĂÂ vrei sĂÂ ĂÂtergi fiĂÂierele?
EÄÂĂÂti sigur cÄÂĂÂ vrei sÄÂĂÂ ÄÂĂÂtergi fiÄÂĂÂierele?
EĂÂĂÂÄÂĂÂti sigur cĂÂĂÂÄÂĂÂ vrei sĂÂĂÂÄÂĂÂ ĂÂĂÂÄÂĂÂtergi fiĂÂĂÂÄÂĂÂierele?
And I don't understand why. How does it find some characters that don't exist anymore in the file (ã, º, etc.) to be able to replace them? And why is it even replacing them with some other characters?