2

I had this problem of trying to identifying whether there is a date information contained in a paragraph. So here are the issues:

  1. We don't know where the date string might appear. A paragraph would be something like "We would like set the appointment at Nov. 15th. Then we would .....". So we cannot directly use DateTime.parse()

  2. The format of the date is arbitrary, it can be more formal forms like "Nov. 15th" or "08/21/1988" or "5th in this month".

It would be unlikely to cover all the cases given that the date information can have various forms, I just want to cover as many cases as possible. The lightweight solution I can come up with would be regular expressions I guess.... And again that would be a huge expression. Does anyone know if there are better solutions or available regular expressions for this?

(P.S. I would prefer more light weighted approaches, methods like machine learning might be more general but is not applicable to my task here)

faz
  • 313
  • 5
  • 12

2 Answers2

2

I'd propably approach it with a regular expression (or multiple) as well.

I'd make the regular expression match regions that look date-like by matching everything around "th", "nd" "st", month/day names and abbreviations, dot/line/slash/colon separated numbers or such things. Experiment with that and see how good it finds dates with a ton of test-cases.

Parsing the possible dates is another story. I guess you'd need something as powerful as PHP's strtotime.

Another approach is to just clearly define a big collection of possible formats. Then, when one is detected, you can easily parse it. Feels too brute-force for me though

Felk
  • 7,720
  • 2
  • 35
  • 65
  • Thanks, I guess I will start with the suggestions first and try to see how much I can cover.... – faz Apr 17 '15 at 21:54
1

As a starting point, there are seven pages of date regexes over at http://regexlib.com. If you don't know which one you're looking for, I would create an array and apply them one at a time. You'll still have a problem with dates like 11/12/2015 vs. 12/11/2015 so some kind of process for clarification is still necessary (e.g., automatically mail back and ask "Do you mean December 11 or November 12?").

Edward Doolittle
  • 4,002
  • 2
  • 14
  • 27