Convert Arabic Characters (Eastern Arabic Numerals) to Arabic Numerals in Python

Question

Some of our clients submit timestamps like ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣ which Google translates to "03/10/2015 19:01:43". Link here.

How can I achieve the same in Python?

Python includes code to parse dates, have you tried that? Further, Python has code to parse numbers, have you tried that? If not, you could still write that code yourself, the Arabian number literals are properly documented in the Unicode standard. — Ulrich Eckhardt, Oct 08 '15 at 05:22
I tried `from dateutil.parser import parse; parse('٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣ ')` which throws a ValueError and I had a look at this SO answer already: http://stackoverflow.com/a/5879577/1252307. Your comment made me look again a bit closer. Thanks! :-) — kev, Oct 08 '15 at 06:06
@user1252307, I just realized you answered your own question. (Which I didn't notice till long after providing my own answer) . If you like your answer best, it is appropriate to accept your answer. You won't get (or give) points, but it will indicate you have an acceptable answer which helps other users. — jimhark, Oct 13 '15 at 17:51
Just FYI, the number representations you refer to are called [Eastern Arabic numerals](https://en.wikipedia.org/wiki/Eastern_Arabic_numerals). — Burhan Khalid, Oct 13 '15 at 21:57
[`unicodedata.digit()`-based solution is wrong](http://stackoverflow.com/questions/33004571/convert-arabic-characters-eastern-arabic-numerals-to-arabic-numerals-in-python#comment54218009_33011662) — jfs, Oct 19 '15 at 04:44
Would have thought this would be an easier one .. thanks for all the enlightenment @J.F.Sebastian ! — kev, Oct 19 '15 at 06:16
@J.F.Sebastian, thanks for you comment. I've improved my answer below based on the problem you pointed out. (Though I think calling my original answer 'wrong' is harsh given the context of the question.) — jimhark, Oct 20 '15 at 01:56

score 3 · Answer 1 · answered Jun 19 '16 at 11:49

There is also the unidecode library from https://pypi.python.org/pypi/Unidecode.

In Python 2:

>>> from unidecode import unidecode
>>> unidecode(u"۰۱۲۳۴۵۶۷۸۹")
'0123456789'

In Python 3:

>>> from unidecode import unidecode
>>> unidecode("۰۱۲۳۴۵۶۷۸۹")
'0123456789'

score 2 · Answer 2 · edited May 23 '17 at 10:29

My solution fails for a different timestamp: u'۲۰۱۵-۱۰-۱۸ ۰۸:۲۲:۱۱'. Go for J.F. Sebastian's or jimhark's solution.

Using ord get the the unicode code point. The numbers start from 1632 (0).

d = u'٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣'

s = []
for c in d:
    o = ord(c)
    print '%s -> %s, %s - 1632 = %s' %(c, o, o, o - 1632)
    if 1631 < o < 1642:
        s.append(str(o - 1632))
        continue
    s.append(c)   
print ''.join(s)

#or as a one liner:
print ''.join([str(ord(c)-1632) if 1631 < ord(c) < 1642 else c for c in d])

Here is the output of the for loop:

٢ -> 1634, 1634 - 1632 = 2
٠ -> 1632, 1632 - 1632 = 0
١ -> 1633, 1633 - 1632 = 1
٥ -> 1637, 1637 - 1632 = 5
- -> 45, 45 - 1632 = -1587
١ -> 1633, 1633 - 1632 = 1
٠ -> 1632, 1632 - 1632 = 0
- -> 45, 45 - 1632 = -1587
٠ -> 1632, 1632 - 1632 = 0
٣ -> 1635, 1635 - 1632 = 3
  -> 32, 32 - 1632 = -1600
١ -> 1633, 1633 - 1632 = 1
٩ -> 1641, 1641 - 1632 = 9
: -> 58, 58 - 1632 = -1574
٠ -> 1632, 1632 - 1632 = 0
١ -> 1633, 1633 - 1632 = 1
: -> 58, 58 - 1632 = -1574
٤ -> 1636, 1636 - 1632 = 4
٣ -> 1635, 1635 - 1632 = 3
2015-10-03 19:01:43

I've updated my solution to fix the problem J.F. Sebastian pointed out. I have to admit I was unaware of the issue. Switching to unicodedata.decimal seems to fix it. This has been an interesting problem. — jimhark, Oct 20 '15 at 02:01

jfs · Accepted Answer · 2015-10-09T04:58:23.687

2

To convert the time string to a datetime object (Python 3):

>>> import re
>>> from datetime import datetime
>>> datetime(*map(int, re.findall(r'\d+', ' ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣')))
datetime.datetime(2015, 10, 3, 19, 1, 43)
>>> str(_)
'2015-10-03 19:01:43'

If you need only numbers:

>>> list(map(int, re.findall(r'\d+', ' ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣')))
[2015, 10, 3, 19, 1, 43]

edited Oct 09 '15 at 04:58

answered Oct 08 '15 at 07:43

jfs

399,953
195
994
1,670

seems to work for python3, python2.7 gives me an empty list for re.findall(r'\d+', ' ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣'). – kev Oct 09 '15 at 01:48
1

@user1252307: On Python 2, you have to use `u''` string literals, to get Unicode: `datetime(*map(int, re.findall(ur'\d+', u' ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣', re.U))` – jfs Oct 09 '15 at 04:57
aah, thx! I tried u' ٢٠١٥-١٠-٠٣ ١٩:٠١:٤٣' but not ur'\d+' and re.U – kev Oct 09 '15 at 05:10

jimhark · Answer 4 · 2015-10-20T17:15:25.293

While inspired by some of the other answers (thanks @kev), I took a different approach.

(Doh! I just noticed @kev also asked this question.)

You asked specifically about Arabic characters, but it simplifies things to handle all Unicode digits.

Note: I process the same date string, but specify the Unicode characters using Unicode escape sequences because that was easier on my system.

import unicodedata

unicodeDate = u'\u0662\u0660\u0661\u0665-\u0661\u0660-\u0660\u0663 \u0661\u0669:\u0660\u0661:\u0664\u0663'

converted = u''.join([unicode(unicodedata.decimal(c, c)) for c in unicodeDate])
print converted

The second argument to unicodedata.decimal is the default value to return if the first argument doesn't map to a Unicode decimal. The effect of passing in the same character for both arguments is any Unicode decimal is converted to the equivalent ASCII decimal, and all other characters pass through unchanged.

My Original Answer

converted = ''.join([str(unicodedata.digit(c, c)) for c in unicodeDate])

@J.F. Sebastian, provided a helpful comment that pointed out the code above doesn't properly handle super scripts, for example u'\u00b2'. Also in the same group are superscripts: '\u00b3', u'\u00b9'. I found this also effects some code points from:

Apparently unicodedata.digit() tries to pull a digit out of a decorated number, which probably isn't desirable here. But unicodedata.decimal seems like it does exactly what's needed (assuming you don't want to convert decorated digits).

Note: `unicodedata.digit()`-based solution is wrong if you want to convert only numbers in the string e.g., [`u'\u00B2'` is not a decimal number according to Unicode standard (it is a superscript and therefore `u'\u00B2'.isdecimal()` is false)](http://stackoverflow.com/a/3033342/4279). That is why [my solution uses `int()` and `\d` regex](http://stackoverflow.com/a/33009601/4279) that reject non-numbers such as `u'\u00b2'`. — jfs, Oct 19 '15 at 04:42
Wow, good tip. Thanks. Up voted your linked answer. Will update my answer here. (Though I think calling the original answer 'wrong' is a little harsh given the context of the question, but I agree the new behavior is an improvement and 'more right'.) — jimhark, Oct 20 '15 at 01:54

Convert Arabic Characters (Eastern Arabic Numerals) to Arabic Numerals in Python

4 Answers4

My Original Answer