1

I have a data frame where one of the columns is the currency name in spanish fos US Dolares which is

Dólares

But its encoded on HTML so i actually read 'ó' and i cant find any way to decode this for whole column. This is a problem cause i need to export to .csv after and this causes trouble.

I tried with different encoding/decoding libraries like beautifulsoup, HTMLParser and a couple more.

Any idea what could be the problem?

Marco
  • 1,112
  • 1
  • 14
  • 34
  • What code did you use to try? What was the output? Can you post some sample data? What is the text encoding of the input (and output)? – Evan Jan 29 '18 at 17:44
  • What operating system you are using? – Cyberguille Jan 29 '18 at 17:45
  • @Cyberguille MacOS High Sierra 10.13.2 – Marco Jan 29 '18 at 17:52
  • @Evan Im querying from a db using MySQLdb (You can find the code here: https://github.com/pandas-dev/pandas/issues/19447). Most of the time it didn't decode anything at all. The input data had "Dólares" from the word "Dólares" in spanish/ – Marco Jan 29 '18 at 17:54
  • I guess I'm a little confused; HTML isn't an encoding, it's a markup language. The text that HTML displays is encoded with a character set, e.g. ASCII or UTF-8. UTF-8 is recommended now, but if you're using Python 2.7, I'm not sure how strong compatibility is. Possibly related: https://stackoverflow.com/questions/20935151/how-to-encode-and-decode-from-spanish-in-python – Evan Jan 29 '18 at 18:06
  • @Evan when i run Type() over this fields it returns NoneType. – Marco Jan 29 '18 at 18:08
  • Without any sample data, or runnable code, it's tough to help you out. Also, in your github link, you are exporting to JSON, not CSV, which may introduce additional encoding errors. – Evan Jan 29 '18 at 18:13
  • When you do `db = MySQLdb.connect("","","","" )`, do you specify a `charset='utf8'` for example? I'd mess around with that and see what happens. – Jarad Jan 29 '18 at 18:16

1 Answers1

1

I suspect that what you see is what is actually in the database: "Dólares"

You can convert strings like this as follows:

from html2text import unescape

If you want to drop the accent:

unescape("Dólares")

Out[29]'Dolares'

Or if you want to keep the accent:

unescape("Dólares", True)

Out[30]: 'Dólares'

To decode a whole column while keeping the accents:

df.Currency = df.Currency.apply(unescape, unicode_snob=True)
adr
  • 1,731
  • 10
  • 18
  • THIS make the trick, i run into 1 more problem i would like to add for future people. I had to cast to str cause some of the values of the column where 'NoneType'. – Marco Jan 30 '18 at 13:51
  • Good to hear. You could do `df.Currency = df.Currency.fillna("")` before the `unescape`. – adr Jan 30 '18 at 18:21