2

I'm using a webapp to retrieve data from results of a game I play. As I'm brazilian and my language has some latin accented characters, most of the data I retrieve comes in a bad shape for use. Like:

Carlos Lopez = Carlos Lã³Pez

I searched internet and found ftfy as a good fixer for broken text. Anyway, I'm not really aware about unicode, ascii and stuff, so I used ftfy, and received as output the same errors I already have.

In[15]: ftfy.fix_text('Carlos Lã³Pez')
Out[15]: 'Carlos Lã³Pez'

ftfy.fix_encoding('Carlos Lã³Pez')
Out[16]: 'Carlos Lã³Pez'

ftfy.fix_text('Carlos Lã³Pez')
Out[17]: 'Carlos Lã³Pez'

print(ftfy.fix_text('Carlos Lã³Pez'))
Carlos Lã³Pez

print(ftfy.fix_encoding('Carlos Lã³Pez'))
Carlos Lã³Pez

ftfy.explain_unicode('Carlos Lã³Pez')
U+0043  C       [Lu] LATIN CAPITAL LETTER C
U+0061  a       [Ll] LATIN SMALL LETTER A
U+0072  r       [Ll] LATIN SMALL LETTER R
U+006C  l       [Ll] LATIN SMALL LETTER L
U+006F  o       [Ll] LATIN SMALL LETTER O
U+0073  s       [Ll] LATIN SMALL LETTER S
U+0020          [Zs] SPACE
U+004C  L       [Lu] LATIN CAPITAL LETTER L
U+00E3  ã       [Ll] LATIN SMALL LETTER A WITH TILDE
U+00B3  ³       [No] SUPERSCRIPT THREE
U+0050  P       [Lu] LATIN CAPITAL LETTER P
U+0065  e       [Ll] LATIN SMALL LETTER E
U+007A  z       [Ll] LATIN SMALL LETTER Z

ftfy.explain_unicode(unidecode('Carlos Lã³Pez'))
U+0043  C       [Lu] LATIN CAPITAL LETTER C
U+0061  a       [Ll] LATIN SMALL LETTER A
U+0072  r       [Ll] LATIN SMALL LETTER R
U+006C  l       [Ll] LATIN SMALL LETTER L
U+006F  o       [Ll] LATIN SMALL LETTER O
U+0073  s       [Ll] LATIN SMALL LETTER S
U+0020          [Zs] SPACE
U+004C  L       [Lu] LATIN CAPITAL LETTER L
U+0061  a       [Ll] LATIN SMALL LETTER A
U+0033  3       [Nd] DIGIT THREE
U+0050  P       [Lu] LATIN CAPITAL LETTER P
U+0065  e       [Ll] LATIN SMALL LETTER E
U+007A  z       [Ll] LATIN SMALL LETTER Z

print(ftfy.fix_encoding(unidecode('Carlos Lã³Pez')))
Carlos La3Pez

print(ftfy.fix_text(unidecode('Carlos Lã³Pez')))
Carlos La3Pez

I'd like to know if there's any package to fix this kind of error, or if you could give me any lead why Carlos López turned into Carlos Lã³Pez. I'd apreciatte.

Ramon Barros
  • 53
  • 1
  • 9
  • 1
    How did you obtain the string in the first place? Did you correctly `.decode()` the Web data? – DYZ Jan 19 '18 at 02:44
  • Thanks for asking, I didnt know that the way I obtained it was really important. Actually, I did some mechanical work. The webapp has a "import table to csv" button that I'm using. The CSV files is generated with the "wrong" characters. What I'm trying to do is to "convert" this characters to the correct format again (I dont want to bother the admin of the site with this) – Ramon Barros Jan 19 '18 at 02:49
  • Possible duplicate of https://stackoverflow.com/questions/132318/how-do-i-correct-the-character-encoding-of-a-file – Matteo T. Jan 19 '18 at 02:55
  • @MatteoT. Probably not. – DYZ Jan 19 '18 at 02:57
  • Actually, I feel it's a duplicate. I will try the solutions they gave. I may have more insights about my problem! – Ramon Barros Jan 19 '18 at 03:13

1 Answers1

3

Wow, that was tough :) Your string was in the wrong encoding and wrong character case, too.

s = 'Carlos Lã³Pez'
s.upper().encode('cp1252').decode().title()
#'Carlos López'

This code works in Python3, but not in Python2.

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • `'Carlos Lã³Pez'.replace("ã","Ã").encode('cp1252').decode()` may be better, I get 'Carlos LóPez'. – Matteo T. Jan 19 '18 at 03:25
  • @MatteoT. Your solution works only for ã. There may be other non-ASCII letters in the string. – DYZ Jan 19 '18 at 03:27
  • @DYZ that worked awsomely (also with other names)! Thanks very much, could you explain how you find this solution please? Or how was your thinking to solve it. – Ramon Barros Jan 19 '18 at 03:37
  • It was essentially a trial-and-error exercise. I came across the answer almost incidentally. – DYZ Jan 19 '18 at 04:01