1

I'm trying to write a regex that will find currency values in my text. I have values that vary from 2 dollars to 2,240,000,000. I'm trying to write a regex expression that will find all these values, but I'm failing hard. I tried something like:

^\{USD}?(\d*(\d\.?|\.\d{1,2}))$

but didn't work. I appreciate any help :)

EDIT: For clarification, I have a text with several dollar values in it, ranging from 2 ~ 2,000,000,000.

The text is something like:

"The base purchase is USD 2,00. (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."

I want to find and extract these values (USD + numbers) and save it to a list, each value as a different element. Thank you

Vanj
  • 55
  • 1
  • 6
  • 2
    Since you're using Python, what would be wrong with just comparing the number directly against your range boundaries, using inequality operators `<` and `>`? If you're starting with a text number, then just cast it first. – Tim Biegeleisen Jan 09 '19 at 13:19
  • 2
    Could you clarify more specifically what you want it to match? In same places, they represent numbers with, for example, 2.240.000.000,00. Do you want it to match that? – Calvin Godfrey Jan 09 '19 at 13:19
  • 1
    relate: https://stackoverflow.com/questions/3887469/python-how-to-convert-currency-to-decimal – Dani Mesejo Jan 09 '19 at 13:19
  • I edited my question, I appreciate your help – Vanj Jan 09 '19 at 13:33

2 Answers2

3

Multiple things are wrong in your expression : ^\{USD}?(\d*(\d\.?|\.\d{1,2}))$

  1. \{USD}? in regex language this would mean: expect the { literal character followed by USD followed by the character } if any. If you want to have an optional group USD you have to use parenthesis without \: (USD)?. You can use a non-capturing group for this : (?:USD)?.

This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2}))$

  1. (\d\.?|\.\d{1,2}), the whole group should be repeated in order to match the entire string : (\d\.?|\.\d{1,2})*

This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2})*)$

  1. \d\.?: if this is supposed to match the part with a thousand separator it should be a comma not a point regarding your example: \d*,?

This would give : ^(USD)?(\d*(\d,?|\.\d{1,2})*)$

  1. (\d*(\d: this won't work, the second \d will never match because all digit will be consumed by the first \d*, you could use the non-greedy operator ? like this: (\d*?(\d but it's not pretty.

This would give : ^(USD)?(\d*?(\d,?|\.\d{1,2})*)$ which may work for you, but looks less than optimal.

An alternative would be to build your regular expression without an "or" clause using the following parts :

  1. The prefix : "USD ", optional and with optional space : (USD ?)?
  2. The integral part of the amount before the thousand separators: \d+
  3. The integral part of the amount with a thousand separator, optional and repeatable: (,\d+)*
  4. The decimal part, optional : (\.\d+)?

Wich would give something like that: (USD ?)?(\d+)(,\d+)*(\.\d+)?

You can test it on regex101.com

You can further restrict the number of digits in each parts to avoid false-positive :

(USD ?)?(\d{1,3})(,\d{3})*(\.\d{1,2})?

A final version would be optimized with non-capturing groups unless necessary:

(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:\.\d{1,2})?

Edit: the test case you provided uses incoherent use of decimal separators (sometime ".", sometimes ","). If you really want to match that, you can use a character class like this :

(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:[.,]\d{1,2})?

Which matches every number in your example : Regex 101 screenshot

zakinster
  • 10,508
  • 1
  • 41
  • 52
  • Thank you so much for your answer sir. It is very complete, I really learned from it. One more thing, since the prefix "USD" is an option, can I also edit the regex to find "USD" and/or "INR"? – Vanj Jan 09 '19 at 14:28
  • @Vanj yes, in this case, it's relevant to use an or case like this : `((USD|INR|EUR) ?)?(\d+)(,\d+)*(\.\d+)?` – zakinster Jan 09 '19 at 14:30
0

Ok, let's start with

import re
text = "The base purchase is USD 2,00.00 (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."

As, @zakinster proposed, you can find the string numbers that interest you with :

regex = r"(?:USD)?(?:\d+,)*\d+(?:\.\d+)?"
numbers = re.findall(regex, text)

Then, to filter the one you've mentionned :

def toInteger(s): return int(s.split('.')[0].replace(',',''))

def numberBetween(string,lowerBound,upperBound): 
    intValue = toInteger(string)
    return True if intValue>lowerBound & intValue<upperBound else False

print(list(filter(lambda x: numberBetween(x,2,2240000000),numbers)))

should give you what you want :

['2,00.00', '2,300,000', '300,00']
Etienne Herlaut
  • 526
  • 4
  • 12