0

I am trying to understand the difference between UTF-8, ASCII, and Unicode. I've already read Unicode, UTF, ASCII, ANSI format differences. But I am getting some error from Python and I don't know how I can see which kind of format my string has.

For example:

1# 'Klaus-Groth-Straße, Ballahausen'
2# 'Capit\xe1n\n'
3# u'Capit\xe1n\n'

I surmise that

  • 3# = Unicode because of the u'?
  • 1#=?
  • 2#=?

I already tried to write string #1 to a file and wrote myself a small function

def escape(html):
 html=html.replace('ö','ö')
 html=html.replace('Ö','Ö')
 html=html.replace('ä','ä')
 html=html.replace('Ä','Ä')
 html=html.replace('ü','ü')
 html=html.replace('Ü','Ü')
 html=html.replace('ß','ß')
 return html

Before I am going to write my string to a txt file, I want to replace the letters to get the right spelling in my text file (Klaus-Groth-Straße, Buchholz in der Nordheide).

But it's not working :/

Could you tell me which kind of string my 3 examples belong to - Unicode or ASCII or UTF-8? And how do I write the right spelling to a txt by using a string like #1?

Xyz
  • 5,955
  • 5
  • 40
  • 58
user2195049
  • 23
  • 1
  • 1
  • 4

2 Answers2

1

You're correct, example #3 is a Unicode string because of the leading u. That's probably the easiest to deal with.

#1 and #2 are both byte strings. #1 consists completely of ASCII characters, so you won't get any Unicode errors from it; however it contains an HTML entity that you probably want converted to a character. There are various strategies for converting HTML entities, see the question Decoding HTML entities with Python. The result should be a Unicode string.

#2 contains a character that isn't ASCII, but it isn't Unicode either. If it were a UTF-8 string then there would be at least 2 hex bytes, but you have only one. This means that it's part of some other character encoding and needs to be decoded before you work with it. The Windows 1252 code page is probably a good guess.

>>> 'Capit\xe1n\n'.decode('cp1252')
u'Capit\xe1n\n'
>>> print 'Capit\xe1n\n'.decode('cp1252')
Capitán

When you write back out to a file, you'll want to convert Unicode strings back to byte strings. Do that with the encode method on the string. You'll need to decide what encoding you want your file to be in.

f.write(u'Capit\xe1n\n'.encode('utf-8')

or

f.write(u'Capit\xe1n\n'.encode('cp1252')
Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

Where as not actually knowing what is wrong, I have myself encountered a similar problem that I now have solved. I use Delphi9, and my problem was when reading UTF8 from a file, and then writing it back again. To make a long story short, various accents and or graves or similar just vanished from the letters when writing. The tools of encoding or decoding UTF8 seems to not fully do the job, or that Delphi itself does some hidden work in the background.

I ended up writing my own UTF8 decoder and encoder, and now everything works flawlessly. The UTF8 Scheme is actually quite simple. A little bit of bit-shifting and adding, and you are there on both decoding and encoding. I used this :"https://www.rfc-editor.org/rfc/rfc3629" as a reference for my work.

At least it gives you a perfect explanation of the UTF8 standard.

Community
  • 1
  • 1