0

I have a simple script that converts values from one type to another using str.replace. It works fine for values using standard English characters but it seems confused by Spanish characters like ñ.

Example:

df['Best time to call?'] = df['Best time to call?']\
    .str.replace('Afternoons / La tarde','afternoon')\
    .str.replace('Evenings / La noche','evening')\
    .str.replace('Do Not Call / No llamar','')\
    .str.replace('Morning / La mañana','morning')

The converted values are output to a CSV through pandas. When I open the CSV in Excel, 'Morning / La mañana' has been converted to 'Morning / La ma�ana' even though all the other ones worked.

EDIT

The older question you all have suggested was specific to python 2 and I am using python 3.

I have found that preemptively replacing the mojibake solves the issue.

df['Best time to call?'] = df['Best time to call?'].str.replace('�', 'n')

followed by:

df['Best time to call?'] = df['Best time to call?'.str.replace('Morning / La manana','morning')

Not the most elegant solution, but it will do for now.

SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
NWWPA
  • 69
  • 1
  • 7
  • 1
    Have you checked out the response here? I think it should help https://stackoverflow.com/questions/23839465/python-pandas-replace-special-character – Scinana May 27 '21 at 14:41
  • 2
    Your issue might be fixed with the proper UTF-8 encoding – Scinana May 27 '21 at 14:41
  • 1
    Does this answer your question? [Python Pandas Replace Special Character](https://stackoverflow.com/questions/23839465/python-pandas-replace-special-character) – JMA May 27 '21 at 14:43
  • This is not a failure. You have written the file out as UTF-8 but Excel thinks the encoding is Windows-1252. This has a name: *mojibake*, a Japanese name for a problem that first appeared with East Asian scripts but has now become a worldwide plague. What to do depends on what you want. If you want it to look right in Excel, open the file in your program with `encoding="cp1252"`. If you are just using Excel to inspect the file, use something else, such as Notepad++ (free), to view it. – BoarGules May 27 '21 at 14:48
  • I always make a point to check for older answers before posting a question but somehow missed this one. Thanks. I'll check it out. – NWWPA May 27 '21 at 14:48

1 Answers1

1

This is an Excel problem, not a Pandas problem.

Excel is interpreting your UTF-8 encoded CSV file as ISO-8859-1, leading to the mojibake you're seeing.

Either

  • use Excel's Open dialog -> Open Text File -> set the encoding to UTF-8
  • save your file as e.g. XLS, which is always opened with a correct encoding by Excel
  • save the CSV file as ISO-8859-1 instead of UTF-8 (not recommended; you can't represent all characters in ISO-8859-1, and many other software expects UTF-8)
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks for you quick response, but I don't it's Excel. The code is supposed to convert 'Morning / La mañana' to 'morning,' so the character shouldn't even be there for Excel to misinterpret. – NWWPA May 27 '21 at 14:46