4

I'm trying to find ranges of properly formatted currency or numbers in a string with regular expressions. I happen to be using C#, so the regex is formatted that way.

For example, I want to be able to find:

$10,000,000 to $20M
$10k-$20k
100.23k - 200.34k
$20000 and $500600
3456646 to 4230405

It should not match on:

$10,0000,000 to $20,000,000 //extra zero in first number
20k xyz 40k //middle string does not match a range word

Here is my regular expression so far:

(^|\s|\$)([1-9](?:\d*|(?:\d{0,2})(?:,\d{3})*)(?:\.\d*[1-9])?|0?\.\d*[1-9]|0)(|m|k)(?:|\s)(?:|to|and|-|,)(?:|\s)(|\$)([1-9](?:\d*|(?:\d{0,2})(?:,\d{3})*)(?:\.\d*[1-9])?|0?\.\d*[1-9]|0)(\s|m|k)

It seems to be working fairly well, but sometimes matches items I don't expect it to. Examples:

1985 xyz 1999 //2 matches, both numbers without xyz
$10,000,000 xyz $20000000 //1 match on the $2000000
$10,000,0000 to $20,000,000 //1 match on the $10,000,0000 (extra zero on end)

What am I missing? Is it foolish to do this with regex?

robr
  • 919
  • 1
  • 7
  • 16
  • 6
    My take is that all regex'es that don't fit in an 80 character line are too big to either read or debug. I would suggest writing a simple parser once your regex grows larger than the suggested bounds. – Pieter Geerkens Mar 07 '13 at 17:34
  • @Pieter Yeah, I had a feeling it was getting too long myself. It is hard to back down from it once you feel so close though. Maybe I'll try stripping the commas ahead of time and this would simplify it. – robr Mar 07 '13 at 17:49
  • But what is likely to be faster: writing (and researching and testing) the regex, or just writing a simple parser? – Pieter Geerkens Mar 07 '13 at 17:52
  • At this point, regular parsing probably faster. – robr Mar 07 '13 at 18:39
  • 60 minutes and counting. You would have been 1/3 (at least) through writing a parser by now. – Pieter Geerkens Mar 07 '13 at 18:43
  • 1
    @PieterGeerkens Don't underestimate the power of regex. Regex has its limitations, but in a lot of cases, like this one, it will do the trick quite effectively. – Lodewijk Bogaards Mar 08 '13 at 20:31
  • @mrhobo: DOn't get me wrong, I love regex'es, and use them frequently. But I also know my limits in using them, and that a regex of 60-80 charcters long is the limit of my comfort zone, even when well composed as in your answer below. – Pieter Geerkens Mar 08 '13 at 20:59
  • @PieterGeerkens That's good, to know your limits. Just be careful sharing them with other people :) – Lodewijk Bogaards Mar 08 '13 at 21:12

1 Answers1

2

Here you go buddy

(?<=^|\s)\$?\d+((\.\d{2})?(k|M)|(,\d{3})*)\b\s*(to|-|and )\s*\$?\d*((\.\d{2})?(k|M)|(,\d{3})*)(\s|$)

see it in action.

This part

\d+((\.\d{2})?(k|M)|(,\d{3})*)

is repeating itself. So better save that in a constant and concat this regex together.

String moneyPattern = @"\d+((\.\d{2})?(k|M)|(,\d{3})*)";
String rangeConnectorPattern = @"\b\s*(to|-|and\b)\s*";
String moneyRangePattern = @"(?<=^|\s)"+ 
    moneyPattern + rangeConnectorPattern +  moneyPattern +
    "(\s|$)";

No need to write a parser.

Lodewijk Bogaards
  • 19,777
  • 3
  • 28
  • 52