3

I've been looking for a simple way to convert a number from a unicode string to an ascii string in python. For example, the input:

input = u'\u0663\u0669\u0668\u066b\u0664\u0667'

Should yield '398.47'.

I started with:

NUMERALS_TRANSLATION_TABLE = {0x660:ord("0"), 0x661:ord("1"), 0x662:ord("2"), 0x663:ord("3"), 0x664:ord("4"), 0x665:ord("5"), 0x666:ord("6"), 0x667:ord("7"), 0x668:ord("8"), 0x669:ord("9"), 0x66b:ord(".")}
input.translate(NUMERALS_TRANSLATION_TABLE)

This solution worked, but I want to be able to support all numbers-related characters in unicode, and not just Arabic. I can translate the digits by going over the unicode string and running unicodedata.digit(input[i]) on each character. I don't like this solution, because it doesn't solve '\u066b' or '\u2013'. I could solve these by using translate as a fallback, but I'm not sure whether there are other such characters that I'm not currently aware of, and so I'm trying to look for a better, more elegant solution.

Any suggestions would be greatly appreciated.

PaF
  • 3,297
  • 1
  • 14
  • 15

2 Answers2

3

Using unicodedata.digit() to look up the digit values for 'numeric' codepoints is the correct method:

>>> import unicodedata
>>> unicodedata.digit(u'\u0663')
3

This uses the Unicode standard information to look up numeric values for a given codepoint.

You could build a translation table by using str.isdigit() to test for digits; this is true for all codepoints for which the standard defines a numeric value. For decimal points, you could look for DECIMAL SEPARATOR in the name; the standard doesn't track these separately by any other metric:

NUMERALS_TRANSLATION_TABLE = {
    i: unicode(unicodedata.digit(unichr(i)))
    for i in range(2 ** 16) if unichr(i).isdigit()}
NUMERALS_TRANSLATION_TABLE.update(
    (i, u'.') for i in range(2 ** 16)
    if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))

That produces a table of 447 entries, including 2 decimal points at U+066b ARABIC DECIMAL SEPARATOR and U+2396 DECIMAL SEPARATOR KEY SYMBOL; the latter is really just a made-up symbol to put on the decimal separator key on a numeric keypad where a manufacturer doesn't want to commit themselves to printing a , or . decimal separator on that key.

Demo:

>>> import unicodedata
>>> NUMERALS_TRANSLATION_TABLE = {
...     i: unicode(unicodedata.digit(unichr(i)))
...     for i in range(2 ** 16) if unichr(i).isdigit()}
>>> NUMERALS_TRANSLATION_TABLE.update(
...     (i, u'.') for i in range(2 ** 16)
...     if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))
>>> input = u'\u0663\u0669\u0668\u066b\u0664\u0667'
>>> input.translate(NUMERALS_TRANSLATION_TABLE)
'398.47'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Hmm, I keep getting "'str' object has no attribute 'isdecimal'". Python 2.7 – chrisaycock Aug 14 '14 at 17:22
  • @chrisaycock: ah, of course, Python 2.7, not 3.x. Will adjust. – Martijn Pieters Aug 14 '14 at 17:22
  • Sorry for downvoting you, but your answer doesn't add anything to what I already submitted in my original question. Moreover, it doesn't solve the main issue I raised, which is other characters used in number representations such as `'\u066b'` and `'\u2013'` that I might not be aware of at the moment. – PaF Aug 14 '14 at 17:49
  • @PaF: The interpretation of `\u066b` and `\u2013` is not clearly defined in the unicode data available from Python, so there is no way to automate those. For the digits, using `str.isdigit()` is the best method to build the table. I'm sorry that you feel that my answer doesn't add anything though; I've edited it a little to clarify what I am trying to say. – Martijn Pieters Aug 14 '14 at 17:53
  • @PaF: There is no indication *in the standard* that 2013 is to be recognised as part of a numeric value; if you wanted to define all dashes as minus signs, you'd need to add additional mapping. For what it is worth, I added all decimal separators; there are just 2. – Martijn Pieters Aug 14 '14 at 18:03
  • @PaF: looking at [Wikipedia on the Decimal mark](http://en.wikipedia.org/wiki/Decimal_mark) the Persians use a forward slash as the decimal separator (calling it the *momayyez*); you'd have to map that entirely manually as there is not indication in the standard that it has any numeric function to play. My answer is stretching what the Unicode standard can help you with as it is. – Martijn Pieters Aug 14 '14 at 18:16
  • Thanks. I upvoted and accepted your answer. I was originally looking for a definitive solution, encompassing all possibilities I might encounter; the `'\u2013'` is an actual "minus" sign I encountered in the localization of the data that I'm required to parse. I now understand that this request is rather far-fetched, and I'll have to build my own DB for my use-case. You gave me a strong starting point though, and I greatly appreciate it. – PaF Aug 15 '14 at 21:31
0
>>> from unidecode import unidecode
>>> unidecode(u'\u0663\u0669\u0668\u066b\u0664\u0667')
'398.47'
idwaker
  • 406
  • 3
  • 10
  • Unidecode doesn't have nearly the coverage of using `unicodedata.digit()`; try 0x1090 through to 0x1099 for example, or 0xa8d0-0xa8d9, 0xa900-0xa909, 0x1946-0x194f, etc. – Martijn Pieters Aug 14 '14 at 17:33
  • On the other hand, 0x2474-0x247c are nicely represented by putting the digits in parentheses; 0x2474, `⑴` is represented as `(1)` by unidecode. Ditto for 0x2488-0x24fd, adding a dot after the number. – Martijn Pieters Aug 14 '14 at 17:35
  • thanks, probably might have broke something with unidecode :) – idwaker Aug 14 '14 at 17:56
  • No, `unidecode` is a great effort but it is not exhaustive. – Martijn Pieters Aug 14 '14 at 17:56