Detecting accents in words (Python)

Question

Here's the dealio: I've written a program that finds all of the algorithm classes in the dictionary. However, I'm having a problem dealing with accented characters. Currently my code reads them in, treats them like they're invisible, but still prints out some sort of replacement code at the end in the form of '\xc3\???'. I'd like to discard all of the words with accents, but I don't know how to detect them.

Things I've tried:

checking if the type is unicode
using a regex to check for words containing '\xc3'
decoding/encoding (I don't understand unicode completely but whatever I tried didn't work).

QUESTION/PROBLEM: I need to find out how to detect accents, but my program prints the accents onto the command line as weird '\xc3\???' characters, which is not how the program treats them, as I haven't been able to find any words containing '\xc3\???' despite that being printed to the command line.

Example: sé -> s\xc3\xa9, and sé and s are considered anagrams by my program.

Test dictionary:

stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se

Output of Code:

Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']

Program itself:

import re

anadict = {};

for line in open('fakedic.txt'):#/usr/share/dict/words'):
        word = line.strip().lower().replace("'", "")
        line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
        if isinstance(word, unicode):
                print word
                print "UNICODE!"
        pattern = re.compile(r'xc3')
        if pattern.findall(word):
               print 'Found'
               print word
        if anadict.has_key(line):
                if not (word in anadict[line]):
                        anadict[line].append(word)
        else:
                anadict[line] = [word]

for key in anadict:
        if (len(anadict[key]) >= 1):
                print anadict[key]

Help?

I recommend reading this:http://www.joelonsoftware.com/articles/Unicode.html — Will, Feb 18 '14 at 05:26

score 1 · Answer 1 · edited May 23 '17 at 11:52

1

So basically scratch my answer... Just look here:

How to check if a string in Python is in ASCII?

The gist is that you can check every character to see if the ord of the char is less than 128, which allows you to check if it's an accented character. Or you can do a lot of try and catching, looking for unicode errors which will throw during accented characters. (The latter seems to be more of the efficient answer)

This was definitely a learning experience for me as well :) Sorry for taking so long

edited May 23 '17 at 11:52

Community

1
1

answered Feb 18 '14 at 04:03

ForgetfulFellow

2,477
2
22
33

While that did allow 'print "sé"' (explicit) to work, it unfortunately did nothing to fix the problem I mentioned. My program still things sé -> s\xc3\xa9, and s\xc3\xa9 and s are anagrams. – Worcestershire Feb 18 '14 at 04:08
1

Ok, I get what you're saying now; I shall look into it – ForgetfulFellow Feb 18 '14 at 04:10
What do you mean your program still 'thinks sé -> s\xc3\xa9' ? – ForgetfulFellow Feb 18 '14 at 04:15
When it takes in the word sé in the dictionary, it returns s\xc3\xa9 as the output word in a list of anagrams. S is apparently an anagram for s\xc3\xa9, and s\xc3\xa9 was originally entered as the word sé, but the program translated it weirdly. – Worcestershire Feb 18 '14 at 04:17
I've updated my answer, hopefully it should lead you to to finishing your code – ForgetfulFellow Feb 18 '14 at 06:45

score 0 · Accepted Answer · answered Mar 04 '14 at 03:39

I ended up using regular expressions (basically to check for everything which wasn't an alphabetic character) with:

if re.match('^[a-zA-Z_]+$', word):

Which helped me strip out any word that had a \ or any other number or funky symbol in it. Not a perfect solution, but it worked.

Detecting accents in words (Python)

2 Answers2