0

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.

I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?

Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
  • effbot might be able to help you.. http://effbot.org/zone/unicode-objects.htm – William Dec 16 '10 at 06:37
  • 1
    possible duplicate of [Convert XML/HTML Entities into Unicode String in Python](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python) – Josh Lee Dec 16 '10 at 06:37
  • 7
    Actually, I think [this is Spanish](http://translate.google.com/#auto|en|Recu%C3%A9rdame%20) (never heard this in French, anyway). – Cameron Dec 16 '10 at 06:55

4 Answers4

5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
    contents = f.read().decode('utf-8-sig')   # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

Community
  • 1
  • 1
Cameron
  • 96,106
  • 25
  • 196
  • 225
  • what's BOM (in context of -sig) ? – Srikar Appalaraju Dec 16 '10 at 06:46
  • 2
    @MovieYoda: Ah, check out [this article](http://en.wikipedia.org/wiki/Byte-order_mark). Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. `-sig` removes the BOM if it's present so you don't get the marker appearing as part of your unicode string. – Cameron Dec 16 '10 at 06:51
1

It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, &#xE9 is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.

To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Yes and no. It represents a numeric code point. You can’t say it’s an escaped UTF-8 character. It may be a Unicode character, but that’s something different. – tchrist Dec 16 '10 at 07:12
  • 1
    Sure, all characters that exist in the set of Unicode characters are Unicode characters, of course. But with that definition, anything that can be decoded into Unicode is a Unicode string, including ASCII strings, and then the term "Unicode string" loses all meaning. A Unicode string is a string of Unicode data, and in Python, thats something held in a Unicode object. Anything that is encoded should *not* be called a Unicode string, it just makes people confused. – Lennart Regebro Dec 16 '10 at 07:15
0

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

user1866080
  • 35
  • 1
  • 4
0

It is HTML an this construct is called „entity“. You can use

def entity_decode(match):
    _, is_hex, entity = match.groups()
    base = 16 if is_hex else 10
    return unichr(int(entity, base))

print re.sub("(?i)(&#(x?)([^;]+);)", 
       entity_decode,
       "Recurdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

nils
  • 628
  • 3
  • 8
  • No, there are entities that are not Latin-1. Such as Α a greek Alpha . They are UCS-2, which is two byte and quite tricky to combine with your technique. – Lennart Regebro Dec 16 '10 at 06:53
  • It was a problem with your Latin-1 decoding technique, yes. Now you are using unichr, which works with number enteties. It still however, does not work with named enteties. And once you add that, your code will be the same as effbots code, that everyone else links to already. :-) – Lennart Regebro Dec 16 '10 at 07:07