7

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.

I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.

sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False

I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.

I'm out of ideas here, any suggestions?

@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.

dda
  • 6,030
  • 2
  • 25
  • 34
user1411331
  • 71
  • 1
  • 4
  • How do you know the file is in Latin-1, and not UTF-8 or the Windows encoding? – millimoose Nov 10 '12 at 20:24
  • I don't know, Latin-1 was my best guess. How do I check? – user1411331 Nov 10 '12 at 20:26
  • @user1411331: Try decoding with UTF-8. Most likely, if it is UTF-8, it will succeed, whereas if it tries to decode Latin-1 with UTF-8, it will fail. Trying to decode UTF-8 with Latin-1 will not fail, but will give you bad data, e.g., `¿Qué tipo es el?`. – icktoofay Nov 10 '12 at 20:29
  • Use a tool like [`od`] to see what bytes are in the actual file. If the file is UTF-8, `'é'` is encoded using more than one byte. Telling apart CP1252 and Latin-1 is more tricky, you'll need to look at their respective specs and find out which character is encoded differently in the two. – millimoose Nov 10 '12 at 20:29
  • My guess is it's not Latin-1, because that encoding was mostly used on Unixen, and most of Linux has transitioned to using UTF-8 throughout. – millimoose Nov 10 '12 at 20:30
  • @icktoofay: Just tried, can't decode with UTF-8, get the following error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte – user1411331 Nov 10 '12 at 20:37
  • @user1411331: Then it's not UTF-8. It could be Latin-1, or it could be some other encoding. – icktoofay Nov 10 '12 at 20:38
  • The file is coming from MySQL output, any guess as to what encoding that might be? – user1411331 Nov 10 '12 at 20:39
  • According to http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html, the default character set for MySQL is Latin-1. – icktoofay Nov 10 '12 at 20:45
  • print 'é'.decode('utf-8') in u'Here is my résumé' --> True. Testing for just 'é' will not get you very far... – dda Nov 11 '12 at 01:20

2 Answers2

5

Use unicodedata.normalize on the string before checking.

Explanation

Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)

icktoofay
  • 126,289
  • 21
  • 250
  • 231
  • 3
    For extra fun, add extra combining marks! `á́́́́́́́́́́́́́́́́́́́` – icktoofay Nov 10 '12 at 20:31
  • 1
    @user1411331: Maybe your Python script is UTF-8 encoded but the data from the database is Latin-1. Try setting `spaChar` like `spaChar = unicodedata.normalize('NFKD', u'á')`. This might make Python need a `# encoding: utf-8` comment at the top of the file. – icktoofay Nov 10 '12 at 21:07
  • Yes, was I not supposed to? I even tried all of the form options, none worked. Here's my code: `sentence = sentence.decode('latin-1')` `sentence = unicodedata.normalize('NFKD', sentence)` `spaChar = 'á'.decode('latin-1')` `spaChar = unicodedata.normalize('NFKD', spaChar)` `print spaChar in sentence` `>>> False` – user1411331 Nov 10 '12 at 21:08
  • Woo hoo, success!! That last comment to essentially combined utf-8 and latin-1 worked, thanks!!! – user1411331 Nov 10 '12 at 21:10
0

I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') is incorrect. Just use a Unicode constant instead u'é'.

To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.

Example (save script in UTF-8):

# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
    sentence = f.readline()
if u'é' in sentence:
    print u'Found é'

Note that print implicitly encodes the output in the terminal encoding.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251