Python - read csv file of unicode substitutions

Question

I need to replace unicode according to a custom set of substitutions. The custom substitutions are defined by someone else's API and I basically just have to deal with it. As it stands I have extracted all the required substitutions into a csv file. Here's a sample:

\u0020, 
\u0021,!
\u0023,#
\u0024,$
\u0025,%
\u0026,&
\u0028,(
\u0029,)
\u002a,*
\u002b,+
\u002c,","
\u002d,-
\u002e,.
\u002f,/
\u03ba,kappa
...

I generated this in MS Excel by hacking up the java program the API owners use for themselves when they need to do conversions (and no...they won't just run the converter when the API receives input...). There are ~1500 substitutions defined.

When I generate output (from my Django application) to send to their API as input, I want to handle the substitutions. Here is how I have been trying to do it:

class UTF8Converter(object):
    def __init__(self):
        #create replacement mapper
        full_file_path = os.path.join(os.path.dirname(__file__),
                                      CONVERSION_FILE)
        with open(full_file_path) as csvfile:
            reader = csv.reader(csvfile)
            mapping = []
            for row in reader:
                #remove escape-y slash
                mapping.append( (row[0], row[1]) ) # here's the problem
        self.mapping = mapping

    def replace_UTF8(self, string):
        for old, new in self.mapping:
            print new
            string.replace(old, new)
        return string

The problem is that the unicode codes in the csv file are appearing as, for example, self.mapping[example][0] = '\\u00e0'. Ok, well that's wrong, so let's try:

mapping.append( (row[0].decode("string_escape"), row[1]) )

No change. How about:

mapping.append( (row[0].decode("unicode_escape"), row[1]) )

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

I have also tried row[0].decode("utf-8"), row[0].encode("utf-8"), unicode(row[0], "utf-8").

I also tried this but I don't have unicode characters in the csv file, I have unicode code points (not sure if that is the correct terminology or what).

So, how do I turn the string that I read in from the csv file into a unicode string that I can use with mythingthatneedsconverted.replace(...)?

Or...do I need to do something else with the csv file to use a more sensible approach?

As a side note, why are you using a list of translations and walking the whole list to call `replace` with each, instead of just building a table to use with [`unicode.translate`](http://docs.python.org/2.7/library/stdtypes.html#str.translate)? — abarnert, Feb 13 '14 at 23:23
Also, `string.replace(old, new)` just returns a new string, it doesn't change `string` in any way. Also, you can't search UTF-8 data for Unicode strings, you have to decode it to Unicode and then do the work there. — abarnert, Feb 13 '14 at 23:25

score 1 · Accepted Answer · answered Feb 13 '14 at 23:29

I don't think your problem actually exists:

Ok, now self.mapping[example][0] = u'\xe0'. So yeah, that's the character that I need to replace...but the string that I need to call the replace_UTF8() function on looks like u'\u00e0'.

Those are just different representations of the exact same string. You can test it yourself:

>>> u'\xe0' == u'\u00e0'
True

The actual problem is that you're not doing any replacing. In this code:

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string.replace(old, new)
    return string

You're just calling string.replace over and over, which returns a new string, but does nothing to string itself. (It can't do anything to string itself; strings are immutable.) What you want is:

def replace_UTF8(self, string):
    for old, new in self.mapping:
        print new
        string = string.replace(old, new)
    return string

However, if string really is a UTF-8-encoded str, as the function name implies, this still won't work. When you UTF-8-encode u'\u00e0', what you get is '\xce\xa0'. There is no \u00e0 in there to be replaced. So, what you really need to do is decode it, do the replaces, then re-encode. Like this:

def replace_UTF8(self, string):
    u = string.decode('utf-8')
    for old, new in self.mapping:
        print new
        u = u.replace(old, new)
    return u.encode('utf-8')

Or, even better, keep things as unicode instead of encoded str throughout your program except at the very edges, so you don't have to worry about this stuff.

Finally, this is a very slow and complicated way to do the replacing, when strings (both str and unicode) have a built-in translate method to do exactly what you want.

Instead of building your table as a list of pairs of Unicode strings, build it as a dict mapping ordinals to ordinals:

mapping = {}
for row in reader:
    mapping[ord(row[0].decode("unicode_escape"))] = ord(row[1])

And now, the whole thing is a one-liner, even with your encoding mess:

def replace_UTF8(self, string):
    return string.decode('utf-8').translate(self.mapping).encode('utf-8')

if I understand `translate` correctly it is for 1-1 character replacement. sometimes i need to replace a single character with multiple characters. see edit in csv example. i am trying out the other solutions now. — andy, Feb 14 '14 at 00:16
>I don't think your problem actually exists - Yeah I thought it was one of those kinds of problems!! :) the "string" was actually a unicode string so the method that worked was the one without the decode()/encode() included. The whole problem was the immutability of strings. Duh. Thanks. — andy, Feb 14 '14 at 00:23
@andy: As the linked docs say, the translation table "must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None". For example: `u'abc'.translate({97: u'xxx'})` will return `u'xxxbc'`. — abarnert, Feb 14 '14 at 00:30
@andy: Also, if the string is a `unicode`, you probably shouldn't give the function a misleading name like `replace_UTF8`. — abarnert, Feb 14 '14 at 00:31

Python - read csv file of unicode substitutions

1 Answers1