1

I wrote a Regular expression for Duration

Regular Expression is

([0-9]+ (?:[y|Y]ears?|[y|Y]rs?|[m|M]o?nths?|[d|D]a?ys?) ?)+

You can check this on this regex tool.

Test Cases that matched

  1. This October I will complete 24 years. Right now I am 3 months short means 23 years 9 mnths 19 days.
  2. ATL is servering Research work from last 10 years 23 months 19 dys.

Test cases that should be matched, but not matched

  1. I am twenty three years old.
  2. There was a disaster came exactly twenty two years twelve months thirty days back.

Doubts

  1. Help me to detect English words of numerics, see 3rd and 4th case.

EDITED 1

I added reFourDigits varibale to handle Twelve hundred twenty type cases. But it fails to catch that. Please help me in that. Below are all the details regarding above problem.

public static final String reDigit = "(?:[O|o]ne|[t|T]wo|[t|T]hree|[f|F]our|[f|F]ive|[s|S]ix|[s|S]even|[e|E]ight|[n|N]ine)";
    public static final String reTeen = "(?:[t|T]wenty|[t|T]hirty|[f|F]orty|[f|F]ifty|[s|S]ixty|[s|S]eventy|[e|E]ighty|[n|N]inety)";
    public static final String re10_19 = "(?:[t|T]en|[e|E]leven|[t|T]welve|[t|T]hirteen|[f|F]ourteen|[f|F]ifteen|[s|S]ixteen|[s|S]eventeen|[e|E]ighteen|[n|N]ineteen)";
    public static final String reTwoDigits = "(?:(?:" + reTeen + "[- ])?" + reDigit + "|" +  re10_19  + "|" + reTeen + ")";
    public static final String reThreeDigits = "(?:(?:" + reDigit + " hundred (?:and)?)?" + reTwoDigits + "|" + reDigit + " hundred)";
    public static final String reFourDigits = "(?:" + reTwoDigits + " hundred (?:and)? " + reTwoDigits + ")"; 
    public static final String reSixDigits = "(?:(?:" + reThreeDigits + " thousand (?:and )?)?" + reThreeDigits + "|" + reThreeDigits + " thousand|" + reFourDigits + ")";
    public static final String reTwelveDigits = "(?:(?:" + reSixDigits + " million (?:and )?)?" + reSixDigits + "|" + reSixDigits + " million)";

Pattern is

String patternString = "\\b( ?(?:[,0-9]+|"+Constants.reTwelveDigits+") ?)\\b";

When I run There are twenty hundred twenty two apples. It finds two strings twenty and twenty two, instead of twenty hundred twenty two.

devsda
  • 4,112
  • 9
  • 50
  • 87
  • I think it'd be better if you can update the question with some more examples of what should be matched and what shouldn't. – Amal Murali Jun 17 '14 at 12:37
  • @Amal in The cases 3 and 4 there and as explained in the doubts are examples of what should match (i.e. twenty thee years, twelve months, hirty days...) – llrs Jun 17 '14 at 12:39
  • 3
    `[y|Y]` etc. should be `[yY]`, `[mM]` etc. – hjpotter92 Jun 17 '14 at 12:42
  • 2
    Other than listing every possible number word to get matched I don't see a way to accomplish this with a pure regular expression. Maybe it would make sense to programatically replace occurences of number words (with the corresponding number) before you use the regular expression? – Sascha Wolf Jun 17 '14 at 12:50
  • @Llopis What you wrote in the above comment. Can you explain please. – devsda Jun 17 '14 at 13:33
  • @Llopis: please don't post answers as comments. If you think you have a solution, post it as answer where we can easily read it and give you feedback on it. This kind of half-baked solution spread out over a potentially endless series of replies-to-replies is precisely what Stack Overflow is trying to free us from. – Alan Moore Jun 23 '14 at 15:20

1 Answers1

3

Personally, I would recommend a real parser. It is possible with a regex, but it can become a very lenghty pattern. Below I used define from the PHP dialect of regex to avoid duplicate patterns. If the regex engine of your choice has no such construct, then you may need to expand every definition, which results in a pretty long pattern. You can still avoid having to write it out yourself by dynamically building up the pattern string with simple string concatenation.

(?(DEFINE)(?<Digit>one|two|three|four|five|six|seven|eight|nine))
(?(DEFINE)(?<Teen>twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety))
(?(DEFINE)(?<TwoDigits>((?&Teen)-)?(?&Digit)|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|(?&Teen)))
(?(DEFINE)(?<ThreeDigits>((?&Digit) hundred (and )?)?(?&TwoDigits)|(?&Digit) hundred))
(?(DEFINE)(?<SixDigits>((?&ThreeDigits) thousand (and )?)?(?&ThreeDigits)|(?&ThreeDigits) thousand))
(?(DEFINE)(?<TwelveDigits>((?&SixDigits) million (and )?)?(?&SixDigits)|(?&SixDigits) million))

Fiddle: http://regex101.com/r/oM4oF2

Prepend the definitions to your expression,
then you can replace every [0-9]+ by (?:[0-9]+|(?&TwelveDigits)).

EDIT: As far as I can tell, Java has no reusable subpatterns, so you will have to fully expand the pattern.

string reDigit = "(?:one|two|three|four|five|six|seven|eight|nine)";
string reTeen = "(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)";
string reTwoDigits = "(?:(?:" + reTeen + "-)?" + reDigit + "|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|" + reTeen + ")";
string reThreeDigits = "(?:(?:" + reDigit + " hundred (?:and )?)?" + reTwoDigits + "|" + reDigit + " hundred)";
string reSixDigits = "(?:(?:" + reThreeDigits + " thousand (?:and )?)?" + reThreeDigits + "|" + reThreeDigits + " thousand)";
string reTwelveDigits = "(?:(?:" + reSixDigits + " million (?:and )?)?" + reSixDigits + "|" + reSixDigits + " million)";

string reNumeric = "\\b(?:[0-9]+|" + reTwelveDigits + ")\\b";

I could not find a Java fiddle site, so I used JavaScript instead, which has a similar regex engine: http://jsfiddle.net/f6RmN/

Ruud Helderman
  • 10,563
  • 1
  • 26
  • 45
  • @Rudd And, please exlplain the logic also. – devsda Jun 17 '14 at 14:52
  • @devnull: Please specify the regex/runtime engine you intend to use (JavaScript, Perl, PHP, .NET, etc). For an explanation, my answer has a hyperlink to a website explaining `define`, and the fiddle provides a highly detailed break-down of the entire expression. – Ruud Helderman Jun 17 '14 at 15:10
  • My project language language is Java – devsda Jun 17 '14 at 15:16
  • Thanks a lot for your solution. In the first line you are talking about parser, `Personally, I would recommend a real parser.`. Can you please elaborate this. Give some links, so that I can make my first parser. – devsda Jun 17 '14 at 18:50
  • @devnull: I must admit I was surprised by the fact that a regex could get this far. I suppose the main benefit of a parser is its potential to _recognize_ the number (e.g. map thirty-two to 32). If you don't need that, then it may be overkill. Anyway, google on `java parser generator` and you will get a lot of promising links. None in particular that I can recommend, sorry. – Ruud Helderman Jun 17 '14 at 19:08
  • @Ruud Please see updated part. I have explained one doubt there. – devsda Jun 23 '14 at 12:29
  • @Ruud I got stuck in one place. Can we have a chat, if you don't mind. – devsda Jun 23 '14 at 14:11
  • @Ruud http://chat.stackoverflow.com/rooms/56130/regular-expression . I posted my doubt hre. Please help me. I totally stuck from regular expression's strange behaviour. – devsda Jun 23 '14 at 15:22
  • @Ruud Please check this. http://jsfiddle.net/f6RmN/7/ I have wrote my changes over your code. – devsda Jun 23 '14 at 19:22
  • @Ruud Above code( jsfiddle.net/f6RmN/8 ) failed on some cases. I tried by using different permutaions but unable to print the desired output. I put all that in one fiddle. Please check. jsfiddle.net/f6RmN/14 – devsda Jun 24 '14 at 09:27
  • @devnull Sometimes, changing the order of subpatterns helps (i.e. `A|B` may behave differently from `B|A`); I put reFourDigits in front inside reSixDigits. Hope this will not spawn other quirks. http://jsfiddle.net/f6RmN/15/ – Ruud Helderman Jun 24 '14 at 09:43
  • @Ruud Ohk. But now it escapes capital letters. Please see this fiddle. http://jsfiddle.net/f6RmN/16/ (Failed to read Twelve, Ninety, but fine for ninety, twelve.) – devsda Jun 24 '14 at 09:51
  • @Ruud For the time being I make my given string in LowerCase. But Can you tell me why this ambiguity occurs in regular expression ? – devsda Jun 24 '14 at 10:21
  • @devnull "Ninety" does not match because my pattern is `ninety`. `[Nn]inety` would work, but it looks cluttered; why not make the entire regex case-insensitive? `new RegExp(reNumericWithoutUnit, "gi");` About the order of subpatterns, I found this Q&A: http://stackoverflow.com/questions/10248776/why-order-matters-in-this-regex-with-alternation – Ruud Helderman Jun 24 '14 at 11:00
  • @Ruud When I give input, `one million, one million thirty two`. It reads first onenumber only. I swaped the position in `reSixDigits` variable. But doesn't make any change. Check this, link http://jsfiddle.net/f6RmN/17/ – devsda Jun 26 '14 at 12:22
  • @Ruud Ohk, I will wait for your reply. – devsda Jun 27 '14 at 05:13
  • @Ruud Please check this link http://jsfiddle.net/f6RmN/19/ . It contains all the failure examples. – devsda Jun 27 '14 at 06:25
  • @devnull: I refactored reFourDigits and some of the other patterns; seems to work better now: http://jsfiddle.net/f6RmN/20/ . BTW, most humans don't call 3000 "thirty hundred", but I could agree it is overkill to exclude it. – Ruud Helderman Jun 27 '14 at 16:12
  • @Ruud I need your help again. I want to catch time. I wrote regex for that. But it catches extra spaces, that I am not able to understand. Please visit thi link, http://jsfiddle.net/f6RmN/21/ . Help me. – devsda Jul 30 '14 at 13:41
  • @devnull: What extra spaces? Your fiddle finds two matches; based on a quick look at the regex, that's more or less what I'd expect. – Ruud Helderman Jul 30 '14 at 21:00
  • No, in time unit, am/pm, both ends with space, and that creates problem in my further computation. So please tell me how to remove any extra ending spaces. – devsda Jul 31 '14 at 03:06
  • Sorry, `.text('[' + result[0] + ']'))` clearly shows there's no leading or trailing space in either match. The only space captured is the one between the `4` and the `p`. Tested in Chrome 36. – Ruud Helderman Jul 31 '14 at 12:34