2

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés I wanted to clean this to get José Florés

I tried the following

name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together. – Klaus D. Jan 03 '19 at 18:44
  • I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding – Arpit Acharya Jan 03 '19 at 18:48
  • 1
    This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware [little bobby tables](https://www.xkcd.com/327/) - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually. –  Jan 03 '19 at 18:59
  • 1
    Is this *really* what your data looks like? It's extremely unusual to see two different encodings in the same string. – Mark Ransom Jan 03 '19 at 19:04
  • Well this is actually one of the many records I have – Arpit Acharya Jan 04 '19 at 20:18
  • I just ran across an answer for converting Unicode strings with embedded escape sequences: https://stackoverflow.com/a/24519338/5987. – Mark Ransom Jan 07 '19 at 17:21

2 Answers2

4

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text

def convert_iso_name_to_string(name):
    result = []

    for word in name.split():
        result.append(fix_text(word))
    return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

-1

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s
'Florés'
>>> print(s)
Florés

In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it

You can find the same here Encoding and Decoding Strings

Manoj Patel
  • 329
  • 3
  • 11