1

I'm scraping some data. One of the data points is tournament prize pools. There are many different currencies in the data. I'd like to extract the amount and currency from each value, so that I can use Google to convert these to a base currency. However, it's been a while since I've used regular expressions, so I'm rusty to say the least. Possible formats of the data are as follows:

$534
$22,136.20
3,200,000 Ft HUF
12,500 kr DKK
50,000 kr SEK
$3,800 AUD
$10,000 NZD
€4,500 EUR
¥100,000 CNY
₹7,000,000 INR
R$39,000 BRL

Below is the first regular expression I came up with.

[0-9,.]+(.+)[A-Z]{3}

But that obviously doesn't capture the amount and currency, so I changed it.

([0-9,.]+).+([A-Z]{3})

However, there are issues with this regular expression that I can't figure out.

  1. ([0-9,.]+) by itself works fine to capture just the amount.

  2. When I add .+ to that expression, for some reason it stops capturing the trailing 4 and 0 in the first and second test cases respectively. Why?

  3. Then when I add ([A-Z]{3}), it seems to work perfectly for all of the test cases, but obviously selects nothing in the first two.

  4. So I changed it to ([A-Z]{0,3}), which seems to break everything.

What's happening? How can I change the expression so that it works?

This is where I'm at: ([0-9,.]+)((?:.+)([A-Z]{3}))?

oldboy
  • 5,729
  • 6
  • 38
  • 86
  • Try `([0-9,.]+)(?: [A-Z]{3}| [A-Za-z]+ [A-Z]{3})?` See a [demo](https://regex101.com/r/QuAwJX/1) – The fourth bird Feb 17 '19 at 08:58
  • @Thefourthbird i dont believe that will work since it doesnt capture the currency (i.e. `([A-Z]{3})`) – oldboy Feb 17 '19 at 09:00
  • @Thefourthbird i think i may have figured it out: `([0-9,.]+)((?:.+)([A-Z]{3}))?` however, wont it be tricky accessing the second captured group, the currency? – oldboy Feb 17 '19 at 09:02
  • You mean like this `([0-9,.]+)(?:(?: [A-Za-z]+)? ([A-Z]{3}))?` [demo](https://regex101.com/r/45zTaq/1) – The fourth bird Feb 17 '19 at 09:03

2 Answers2

2

This should work:

([0-9,.]+).*?([A-Z]{3})?$

A few changes I made:

  • I changed the .+ to .*? because there isn't always something after the number (like the first two cases). I used lazy matching here because otherwise it would match everything till the end.

  • I made group 2 optional with a ? because there isn't always a currency (first 2 cases)

  • I added an end of line anchor $ to make the lazy .*? match something instead of nothing.

If you don't know what "lazy" means in this context, see this post.

Demo

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • beautiful. is there a way ill be able to know whether or not the 2nd group exists? `try/catch`? – oldboy Feb 17 '19 at 09:06
  • @Anthony `group(2)` should return `None` if there is no such group. So check for `None`. – Sweeper Feb 17 '19 at 09:08
  • perfect. really appreciate it! any suggested reading for understanding the quantifier operators like why `.+` seem to exclude the trailing digits in the online tester that im using? – oldboy Feb 17 '19 at 09:10
  • @Anthony Because `.+` means 1 or more of anything, right? So it has to match at least one character. – Sweeper Feb 17 '19 at 09:12
1

For the example data, you could use an optional non capturing group to match the space and the characters before the currency:

([0-9,.]+)(?:(?: [A-Za-z]+)? ([A-Z]{3}))?

Regex demo

That will match

  • ( Capture group
    • [0-9,.]+ match 1+ times what is listed in the character class
  • ) Close capture group
  • (?: Non capturing group
    • (?: [A-Za-z]+ )? Optional group to match a space, 1+ times a-zA-Z and space
    • ([A-Z]{3}) Capture 3 uppercase chars
  • )? Close non capturing group and make it optional
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • that certainly seems to work, although it still wouldnt capture the currency?? Or if it does it would be nested and not as easily accessible?? its also super verbose compared to [Sweeper's answer](https://stackoverflow.com/a/54731634/7543162) – oldboy Feb 17 '19 at 09:08
  • 1
    @Anthony Is does capture the currency when it is there. Is is somewhat more verbose because it is a bit more precise match for the demo data in the question. Is also takes into account that there is a space before the actual currency. – The fourth bird Feb 17 '19 at 09:19
  • 1
    yeah but its redundant to check for that space since the currency is the only thing with 3 capital letters. ill still give u a vote for ur effort tho – oldboy Feb 17 '19 at 09:27