0

I have have a file includes some sentences. But some of them contains some wired characters (å, ä, Ä), shown below. What are they and is there a way convert them back to normal characters in python?

Thanks,

Examples.

Is there an outdoor grill/bbq place? P√§r

Hej Hur långt aaär de till Stallarna? MVH LAILA

Är där sandstrand och hur långt

Yonbantai
  • 107
  • 9
  • if you know what char should be in place of `√•` then use `text = text.replac("√•", expected_char)`. But maybe this text uses different encoding then you used to decode it - ie. `Latin1`, `Latin2`, `cp1250`, `iso-8859-2`, etc. Maybe if you use different encoding then you get correct chars. – furas Nov 14 '19 at 19:20
  • or maybe your system use different UTF-8 encoding. As I know MacOS use little different encoding and it can make problems. BTW: I found this on Stackoverflow: [How to decode these characters? √° √© √≠](https://stackoverflow.com/questions/15283189/how-to-decode-these-characters-%E2%88%9A-%E2%88%9A-%E2%88%9A%E2%89%A0) – furas Nov 14 '19 at 19:24

1 Answers1

6

It looks like it used wrong encoding - MacRoman - instead of UTF-8. Probably it is MacOS system.

If I encode it (to bytes) using MacRoman and then decode it back to string using utf-8 then I get correct text

text = '''Is there an outdoor grill/bbq place? P√§r

Hej Hur långt aaär de till Stallarna? MVH LAILA

Är där sandstrand och hur långt'''

text = text.encode('MacRoman').decode('utf-8') 
print(text)

Result:

Is there an outdoor grill/bbq place? Pär

Hej Hur långt aaär de till Stallarna? MVH LAILA

Är där sandstrand och hur långt

Tested on Linux Mint 19.2, Python 3.7

Information about MacRoman from question How to decode these characters? á é í

furas
  • 134,197
  • 12
  • 106
  • 148