0

So I have a code that takes a .txt file and adds it to a variable as a string.

Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing.

Code:

def normalize(filename):

    #Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
    #File says: "Es una rubrica de evaluación." (among many emojis)

    txt_raw = open(filename, "r", errors="ignore")
    txt_read = txt_raw.read()


    #Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs.

    rem_accent_txt = txt_read.replace("ó", "o")
    print(rem_accent_txt)

    return

Expected output:

"Es una rubrica de evaluacion."

Current Output:

"Es una rubrica de evaluación."

It does not print an error or anything, it just prints it as it is.

I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file.

EDIT: SOLUTION!

Thanks to @juanpa.arrivillaga and @das-g I came up with this solution:

from unidecode import unidecode

def get_txt(filename):

    txt_raw = open(filename, "r", encoding="utf8")
    txt_read = txt_raw.read()

    txt_decode = unidecode(txt_read)

    print(txt_decode)

    return txt_decode
  • What version of python do you use? – Ivan Velichko Sep 03 '20 at 19:58
  • Looks like unicode issue - https://stackoverflow.com/questions/13093727/how-to-replace-unicode-characters-in-string-with-something-else-python – zvi Sep 03 '20 at 19:59
  • I use 3.8.3 64-bit. I have tried unicode, but it messes up the characters. For example, it would output something like `evaluaciAn`. – Leonardo Echeverria Sep 03 '20 at 20:00
  • "#Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error." You should instead try to use the correct encoding. – juanpa.arrivillaga Sep 03 '20 at 20:02
  • And potentially the problem is when you *create* the file. How is the text file created? – juanpa.arrivillaga Sep 03 '20 at 20:04
  • @zvi that's not really relevant, the OP is using Python 3 (I hope). – juanpa.arrivillaga Sep 03 '20 at 20:05
  • There's different ways to map something that looks like "ó" to a sequence of Unicode codepoints. You'll probably want to [normalize the string](https://stackoverflow.com/q/16467479/674064) before attempting character replacements. But first, make sure you get the correct Unicode string by reading the file with the appropriate encoding. – das-g Sep 03 '20 at 20:06
  • Btw., if you want to simply get rid of non-Ascii letters, you might not have to do any manual replacement at all, and just choose the right Unicode normalization for that purpose. – das-g Sep 03 '20 at 20:08
  • @LeonardoEcheverria wait wait wait, you are using *Python 2*? You **really** shouldn't be. It is passed it's official end of life and no longer maintained. Furthermore, it fixes a lot of issues, basically, `str` becomes what `unicode` was in Python 2. – juanpa.arrivillaga Sep 03 '20 at 22:57
  • @juanpa.arrivillaga, not at all! Python 3.8.3 64-Bit. Does something in the code suggest it is Python 2.0? – Leonardo Echeverria Sep 04 '20 at 01:20
  • @LeonardoEcheverria I misread `unidecode(txt_read)` as `unicode(txt_read)`, my mistake... but what is `unidecode`? did you mean `unicodedata.normalize`? – juanpa.arrivillaga Sep 04 '20 at 01:31
  • @juanpa.arrivillaga so sorry, I forgot to add my import. `from unidecode import unidecode`. I added it in1 – Leonardo Echeverria Sep 04 '20 at 02:04

1 Answers1

1

Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode:

>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining), len(composed)
(2, 1)
>>> list(combining)
['o', '́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name, combining))
['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name, composed))
['LATIN SMALL LETTER O WITH ACUTE']

Just normalize your strings:

>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC", combining)
True

Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above?

As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode.

>>> emoji = ""
>>> print(emoji)

>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172