13

I just learnt from Format numbers as currency in Python that the Python module babel provides babel.numbers.format_currency to format numbers as currency. For instance,

from babel.numbers import format_currency

s = format_currency(123456.789, 'USD', locale='en_US')  # u'$123,456.79'
s = format_currency(123456.789, 'EUR', locale='fr_FR')  # u'123\xa0456,79\xa0\u20ac'

How about the reverse, from currency to numbers, such as $123,456,789.00 --> 123456789? babel provides babel.numbers.parse_number to parse local numbers, but I didn't found something like parse_currency. So, what is the ideal way to parse local currency into numbers?


I went through Python: removing characters except digits from string.

# Way 1
import string
all=string.maketrans('','')
nodigs=all.translate(all, string.digits)

s = '$123,456.79'
n = s.translate(all, nodigs)    # 12345679, lost `.`

# Way 2
import re
n = re.sub("\D", "", s)         # 12345679

It doesn't take care the decimal separator ..


Remove all non-numeric characters, except for ., from a string (refer to here),

import re

# Way 1:
s = '$123,456.79'
n = re.sub("[^0-9|.]", "", s)   # 123456.79

# Way 2:
non_decimal = re.compile(r'[^\d.]+')
s = '$123,456.79'
n = non_decimal.sub('', s)      # 123456.79

It does process the decimal separator ..


But the above solutions don't work when coming to, for instance,

from babel.numbers import format_currency
s = format_currency(123456.789, 'EUR', locale='fr_FR')  # u'123\xa0456,79\xa0\u20ac'
new_s = s.encode('utf-8') # 123 456,79 €

As you can see, the format of currency varies. What is the ideal way to parse currency into numbers in a general way?

Community
  • 1
  • 1
SparkAndShine
  • 17,001
  • 22
  • 90
  • 134

2 Answers2

11

Below is a general currency parser that doesn't rely on the babel library.

import numpy as np
import re

def currency_parser(cur_str):
    # Remove any non-numerical characters
    # except for ',' '.' or '-' (e.g. EUR)
    cur_str = re.sub("[^-0-9.,]", '', cur_str)
    # Remove any 000s separators (either , or .)
    cur_str = re.sub("[.,]", '', cur_str[:-3]) + cur_str[-3:]

    if '.' in list(cur_str[-3:]):
        num = float(cur_str)
    elif ',' in list(cur_str[-3:]):
        num = float(cur_str.replace(',', '.'))
    else:
        num = float(cur_str)

    return np.round(num, 2)

Here is a pytest script that tests the function:

import numpy as np
import pytest
import re


def currency_parser(cur_str):
    # Remove any non-numerical characters
    # except for ',' '.' or '-' (e.g. EUR)
    cur_str = re.sub("[^-0-9.,]", '', cur_str)
    # Remove any 000s separators (either , or .)
    cur_str = re.sub("[.,]", '', cur_str[:-3]) + cur_str[-3:]

    if '.' in list(cur_str[-3:]):
        num = float(cur_str)
    elif ',' in list(cur_str[-3:]):
        num = float(cur_str.replace(',', '.'))
    else:
        num = float(cur_str)

    return np.round(num, 2)


@pytest.mark.parametrize('currency_str, expected', [
    (
            '.3', 0.30
    ),
    (
            '1', 1.00
    ),
    (
            '1.3', 1.30
    ),
    (
            '43,324', 43324.00
    ),
    (
            '3,424', 3424.00
    ),
    (
            '-0.00', 0.00
    ),
    (
            'EUR433,432.53', 433432.53
    ),
    (
            '25.675,26 EUR', 25675.26
    ),
    (
            '2.447,93 EUR', 2447.93
    ),
    (
            '-540,89EUR', -540.89
    ),
    (
            '67.6 EUR', 67.60
    ),
    (
            '30.998,63 CHF', 30998.63
    ),
    (
            '0,00 CHF', 0.00
    ),
    (
            '159.750,00 DKK', 159750.00
    ),
    (
            '£ 2.237,85', 2237.85
    ),
    (
            '£ 2,237.85', 2237.85
    ),
    (
            '-1.876,85 SEK', -1876.85
    ),
    (
            '59294325.3', 59294325.30
    ),
    (
            '8,53 NOK', 8.53
    ),
    (
            '0,09 NOK', 0.09
    ),
    (
            '-.9 CZK', -0.9
    ),
    (
            '35.255,40 PLN', 35255.40
    ),
    (
            '-PLN123.456,78', -123456.78
    ),
    (
            'US$123.456,79', 123456.79
    ),
    (
            '-PLN123.456,78', -123456.78
    ),
    (
            'PLN123.456,79', 123456.79
    ),
    (
            'IDR123.457', 123457
    ),
    (
            'JP¥123.457', 123457
    ),
    (
            '-JP\xc2\xa5123.457', -123457
    ),
    (
            'CN\xc2\xa5123.456,79', 123456.79
    ),
    (
            '-CN\xc2\xa5123.456,78', -123456.78
    ),
])
def test_currency_parse(currency_str, expected):
    assert currency_parser(currency_str) == expected
aj8907
  • 123
  • 1
  • 6
5

Using babel

The babel documentation notes that the number parsing is not fully implemented yes but they have done a lot of work to get currency info into the library. You can use get_currency_name() and get_currency_symbol() to get currency details, and also all other get_... functions to get the normal number details (decimal point, minus sign, etc.).

Using that information you can exclude from a currency string the currency details (name, sign) and groupings (e.g. , in the US). Then you change the decimal details into the ones used by the C locale (- for minus, and . for the decimal point).

This results in this code (i added an object to keep some of the data, which may come handy in further processing):

import re, os
from babel import numbers as n
from babel.core import default_locale

class AmountInfo(object):
    def __init__(self, name, symbol, value):
        self.name = name
        self.symbol = symbol
        self.value = value

def parse_currency(value, cur):
    decp = n.get_decimal_symbol()
    plus = n.get_plus_sign_symbol()
    minus = n.get_minus_sign_symbol()
    group = n.get_group_symbol()
    name = n.get_currency_name(cur)
    symbol = n.get_currency_symbol(cur)
    remove = [plus, name, symbol, group]
    for token in remove:
        # remove the pieces of information that shall be obvious
        value = re.sub(re.escape(token), '', value)
    # change the minus sign to a LOCALE=C minus
    value = re.sub(re.escape(minus), '-', value)
    # and change the decimal mark to a LOCALE=C decimal point
    value = re.sub(re.escape(decp), '.', value)
    # just in case remove extraneous spaces
    value = re.sub('\s+', '', value)
    return AmountInfo(name, symbol, value)

#cur_loc = os.environ['LC_ALL']
cur_loc = default_locale()
print('locale:', cur_loc)
test = [ (n.format_currency(123456.789, 'USD', locale=cur_loc), 'USD')
       , (n.format_currency(-123456.78, 'PLN', locale=cur_loc), 'PLN')
       , (n.format_currency(123456.789, 'PLN', locale=cur_loc), 'PLN')
       , (n.format_currency(123456.789, 'IDR', locale=cur_loc), 'IDR')
       , (n.format_currency(123456.789, 'JPY', locale=cur_loc), 'JPY')
       , (n.format_currency(-123456.78, 'JPY', locale=cur_loc), 'JPY')
       , (n.format_currency(123456.789, 'CNY', locale=cur_loc), 'CNY')
       , (n.format_currency(-123456.78, 'CNY', locale=cur_loc), 'CNY')
       ]

for v,c in test:
    print('As currency :', c, ':', v.encode('utf-8'))
    info = parse_currency(v, c)
    print('As value    :', c, ':', info.value)
    print('Extra info  :', info.name.encode('utf-8')
                         , info.symbol.encode('utf-8'))

The output looks promising (in US locale):

$ export LC_ALL=en_US
$ ./cur.py
locale: en_US
As currency : USD : b'$123,456.79'
As value    : USD : 123456.79
Extra info  : b'US Dollar' b'$'
As currency : PLN : b'-z\xc5\x82123,456.78'
As value    : PLN : -123456.78
Extra info  : b'Polish Zloty' b'z\xc5\x82'
As currency : PLN : b'z\xc5\x82123,456.79'
As value    : PLN : 123456.79
Extra info  : b'Polish Zloty' b'z\xc5\x82'
As currency : IDR : b'Rp123,457'
As value    : IDR : 123457
Extra info  : b'Indonesian Rupiah' b'Rp'
As currency : JPY : b'\xc2\xa5123,457'
As value    : JPY : 123457
Extra info  : b'Japanese Yen' b'\xc2\xa5'
As currency : JPY : b'-\xc2\xa5123,457'
As value    : JPY : -123457
Extra info  : b'Japanese Yen' b'\xc2\xa5'
As currency : CNY : b'CN\xc2\xa5123,456.79'
As value    : CNY : 123456.79
Extra info  : b'Chinese Yuan' b'CN\xc2\xa5'
As currency : CNY : b'-CN\xc2\xa5123,456.78'
As value    : CNY : -123456.78
Extra info  : b'Chinese Yuan' b'CN\xc2\xa5'

And it still works in different locales (Brazil is notable for using the comma as a decimal mark):

$ export LC_ALL=pt_BR
$ ./cur.py 
locale: pt_BR
As currency : USD : b'US$123.456,79'
As value    : USD : 123456.79
Extra info  : b'D\xc3\xb3lar americano' b'US$'
As currency : PLN : b'-PLN123.456,78'
As value    : PLN : -123456.78
Extra info  : b'Zloti polon\xc3\xaas' b'PLN'
As currency : PLN : b'PLN123.456,79'
As value    : PLN : 123456.79
Extra info  : b'Zloti polon\xc3\xaas' b'PLN'
As currency : IDR : b'IDR123.457'
As value    : IDR : 123457
Extra info  : b'Rupia indon\xc3\xa9sia' b'IDR'
As currency : JPY : b'JP\xc2\xa5123.457'
As value    : JPY : 123457
Extra info  : b'Iene japon\xc3\xaas' b'JP\xc2\xa5'
As currency : JPY : b'-JP\xc2\xa5123.457'
As value    : JPY : -123457
Extra info  : b'Iene japon\xc3\xaas' b'JP\xc2\xa5'
As currency : CNY : b'CN\xc2\xa5123.456,79'
As value    : CNY : 123456.79
Extra info  : b'Yuan chin\xc3\xaas' b'CN\xc2\xa5'
As currency : CNY : b'-CN\xc2\xa5123.456,78'
As value    : CNY : -123456.78
Extra info  : b'Yuan chin\xc3\xaas' b'CN\xc2\xa5'

It is worth to point out that babel has some encoding problems. That is because the locale files (in locale-data) do use different encoding themselves. If you're working with currencies you're familiar with that should not be a problem. But if you try unfamiliar currencies you might run into problems (i just learned that Poland uses iso-8859-2, not iso-8859-1).

grochmal
  • 2,901
  • 2
  • 22
  • 28