I need to replace unicode according to a custom set of substitutions. The custom substitutions are defined by someone else's API and I basically just have to deal with it. As it stands I have extracted all the required substitutions into a csv file. Here's a sample:
\u0020,
\u0021,!
\u0023,#
\u0024,$
\u0025,%
\u0026,&
\u0028,(
\u0029,)
\u002a,*
\u002b,+
\u002c,","
\u002d,-
\u002e,.
\u002f,/
\u03ba,kappa
...
I generated this in MS Excel by hacking up the java program the API owners use for themselves when they need to do conversions (and no...they won't just run the converter when the API receives input...). There are ~1500 substitutions defined.
When I generate output (from my Django application) to send to their API as input, I want to handle the substitutions. Here is how I have been trying to do it:
class UTF8Converter(object):
def __init__(self):
#create replacement mapper
full_file_path = os.path.join(os.path.dirname(__file__),
CONVERSION_FILE)
with open(full_file_path) as csvfile:
reader = csv.reader(csvfile)
mapping = []
for row in reader:
#remove escape-y slash
mapping.append( (row[0], row[1]) ) # here's the problem
self.mapping = mapping
def replace_UTF8(self, string):
for old, new in self.mapping:
print new
string.replace(old, new)
return string
The problem is that the unicode codes in the csv file are appearing as, for example, self.mapping[example][0] = '\\u00e0'
. Ok, well that's wrong, so let's try:
mapping.append( (row[0].decode("string_escape"), row[1]) )
No change. How about:
mapping.append( (row[0].decode("unicode_escape"), row[1]) )
Ok, now self.mapping[example][0] = u'\xe0'
. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'
.
I have also tried row[0].decode("utf-8")
, row[0].encode("utf-8")
, unicode(row[0], "utf-8")
.
I also tried this but I don't have unicode characters in the csv file, I have unicode code points (not sure if that is the correct terminology or what).
So, how do I turn the string that I read in from the csv file into a unicode string that I can use with mythingthatneedsconverted.replace(...)?
Or...do I need to do something else with the csv file to use a more sensible approach?