I'm trying to pull as many dates (YMD, YM, or even just Y) out of a very large text sample in an SQL database. All of the text in the database is in plain English and contains large numbers as well. What I'm looking to do is find as many of these dates as possible, recognize them as dates, and pull out what date they represent as well as the string that was found. Is there a way to go about this other than thinking up every possible regular expression I can to cover cases like 1/23/1985 while also covering cases like "The 1980s" and "1962 marked the beginning". Is there an example out there of a project that already took this on? Thanks!
2 Answers
It seems do you need a mix of:
1) Take substrings from a date with regular expressions with substring postgresql function:
substring(string from pattern)
substring('foobar' from '%#"o_b#"%' for '#') oob
2) Work on all possible date formats. Regular Expression to match valid dates can be a start point.
Perhaps you can write your own user defined function to make this translation. Don't forget to share it with us ;)
* EDITED *
If this is a one time job. You can export your date and use python parsedatetime library to parse dates, then come back to database.
parsedatetime is able to parse, for example, the following: August 25th, 2008 25 Aug 2008 Aug 25 5pm 5pm August 25 next saturday tomorrow next thursday at 4pm at 4pm eod tomorrow eod eod tuesday eoy eom in 5 minutes 5 minutes from now 5 hours before now 2 hours before noon 2 days from tomorrow

- 1
- 1

- 48,760
- 8
- 117
- 177
-
I will definitely share it when it comes time. I'm thinking I'll end up writing it in a higher level language though once I figure out how to do it. I've contacted some people doing NLP research about how to do this. Will post back! – jrbalsano Aug 05 '12 at 15:06
It depends on the future complexity of your problem (e.g., more formats and other possibilities to cover), but I think you can check some Temporal Taggers which in general should handle most cases that you think. You can start with Stanford SUTime which actually uses rules as well.

- 2,862
- 1
- 27
- 37