How to map Arabic letters to phonemes in Python?

Question

I want to make a simple Python script that will map each Arabic letter to phoneme sound symbols. I have a file that has a bunch of words that the script will read to convert them to phonemes, and I have the following dictionary in my code:

Content in my .txt file:

السلام عليكم
السلام عليكم و رحمة الله
السلام عليكم و رحمة الله و بركاته
الحمد لله
كيف حالك
كيف الحال

The dictionary in my code:

ar_let_phon_maplist = {u'ﺍ':'A:', u'ﺏ':'B', u'ﺕ':'T', u'ﺙ':'TH', u'ﺝ':'J', u'ﺡ':'H', u'ﺥ':'KH', u'ﻩ':'H', u'ﻉ':'(ayn) ’', u'ﻍ':'GH', u'ﻑ':'F', u'ﻕ':'q', u'ﺹ':u'ṣ', u'ﺽ':u'ḍ', u'ﺩ':'D', u'ﺫ':'DH', u'ﻁ':u'ṭ', u'ﻙ':'K', u'ﻡ':'M', u'ﻥ':'N', u'ﻝ':'L', u'ﻱ':'Y', u'ﺱ':'S', u'ﺵ':'SH', u'ﻅ':u'ẓ', u'ﺯ':'Z', u'ﻭ':'W', u'ﺭ':'R'}

I have a nested loop where I'm reading each line, converting each character:

with codecs.open(sys.argv[1], 'r', encoding='utf-8') as file:
        lines = file.readlines()

line_counter = 0

for line in lines:
        print "Phonetics In Line " + str(line_counter)
        print line + " ",
        for word in line:
                for character in word:
                        if character == '\n':
                                print ""
                        elif character == ' ':
                                print "  "
                        else:
                                print ar_let_phon_maplist[character] + " ",
line_counter +=1

And this is the error I'm getting:

Phonetics In Line 0
السلام عليكم

Traceback (most recent call last):
  File "grapheme2phoneme.py", line 25, in <module>
    print ar_let_phon_maplist[character] + " ",
KeyError: u'\u0627'

And then I checked if the file type is UTF-8 using the Linux command:

file words.txt

The output I got:

words.txt: UTF-8 Unicode text

Any solution for this problem, why it's not mapping to an Unicode object that is in the dictionary since also the character I'm using as key in ar_let_phon_maplist[character] line is Unicode? Is there something wrong with my code?

score 4 · Accepted Answer · edited May 23 '17 at 11:44

The first thing that catches the eye is KeyError. So your dictionary simply does not know about some symbols encountered in file. Looking ahead, it does not know about ANY of the submitted characters, not only about the first.

What we can to do with it? Okay, we can just add all of the symbols from Arabian segment of unicode table into our dictionary. Simple? Yes. Clear? No.

If you want to actually understand the reasons of this 'strange' behaviour, you should to know more about Unicode. In short, there are a lot of letters that looks similar but have different ordinal numbers. Moreover, the same letter sometimes can be presented in multiple forms. So comparing unicode characters is not a trivial task.

So, if I was allowed to use Python 3.3+ I would solve the task as follows. First I'll normalize keys in ar_let_phon_maplist dictionary:

ar_let_phon_maplist = {unicodedata.normalize('NFKD', k): v 
                            for k, v in ar_let_phon_maplist.items()}

And then we will iterate over lines in file, words in line and characters in word like this:

for index, line in enumerate(lines):
    print('Phonetics in line {0}, total {1} symbols'.format(index, len(line)))
    unknown = []  # Here will be stored symbols that we haven't found in dict
    words = line.split()
    for word in words:
        print(word, ': ', sep='', end='')
        for character in word:
            c = unicodedata.normalize('NFKD', character).casefold()
            try:                
                print(ar_let_phon_maplist[c], sep='', end='')
            except KeyError:
                print('_', sep='', end='')
                if c not in unknown:
                    unknown.append(c)
        print()
    if unknown:
        print('Unrecognized symbols: {0}, total {1} symbols'.format(', '.join(unknown), 
                                                                    len(unknown)))

Script will produce something like that:

Phonetics in line 4, total 9 symbols
كيف: KYF
حالك: HA:LK

score 1 · Answer 2 · answered Dec 31 '15 at 00:42

1

It looks like you forgot that character in the dictionary. You have ﺍ (u'\ufe8d', ARABIC LETTER ALEF ISOLATED FORM), which looks similar, but you don't have ا (u'\u0627', ARABIC LETTER ALEF).

answered Dec 31 '15 at 00:42

KSFT

1,774
11
17

I think you are right, but how can i convert unicode isolated form to normal unicode? – 0x01Brain Dec 31 '15 at 01:10
@0x01Brain I wouldn't call it a different "form"; it's just another character. I'd just put two entries in the dictionary. By the way, if my answer helped you, feel free to vote it up. – KSFT Dec 31 '15 at 01:37
so why they are having different hex values? – 0x01Brain Dec 31 '15 at 01:43
@0x01Brain It's a completely different character. – KSFT Dec 31 '15 at 02:47

How to map Arabic letters to phonemes in Python?

2 Answers2