heuristic (fuzzy) date extraction from the string?

Question

I have a problem to heuristically parse a string of text which contains a date but in a rather arbitrary (unknown) format.

function parseDateStr($text) {
    $cleanText = filter($text);
    # ...
    $day = findDay($cleanText);
    $month = findMonth($cleanText);
    $year = findYear($cleanText);
    # .. assert constraints, parse again or fail
    return sprintf('%04d-%02d-%02d', $year, $month, $day)
}

Input text is a sentence in English language plus arbitrary syntax symbols (like a subset of \W regexp class). The task of the algorithm is to extract date only after filtering away any potential garbage (noisy) words, unrelated to the date. It is allowed that the algorithm could fail and return no result. If only two combination of two joined digits (MM) together with four other digits (YYYY) were found in the string - it is assumed that two digits corresponds to the month of the date and the day is taken to be 01 (first day of the month). Result gives a date in "YYYY-MM-DD" (SQL) format (of type DATE).

My idea is to proceed with designing a series of filters using preg_replace & co. Further, use logical constraints on the range of $year, $day, use a vocabulary for $month, etc., but I would not be surprised if similar but more elegant solutions or approaches are thinkable or already exist. If so, please let me know about them. I would also appreciate if any critics or potential pitfalls can be pointed out.

Relation to similar questions:

Please note that the question is different from more basic date parsing questions as:

since in my case I can not specify or determine the format of the string. On the other hand the following questions talk about similar tasks:

I am not sure if the last one is a duplicate, it is not ultimately clear to me what OP wants to parse (although checkdate and date_parse seem to be partially useful). But the first question on the whole "mokey business" is also true for my case and has been addressed by fuzzy parsing as in

dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)

Finally, the second one contains great grabbing regexp (almost "fuzzy").

PS by elegant I understand that the code is rather compact (there is no significant limitations on performance, so using "hacky" regexps is ok).

As mentioned in one of your links, how do you parse 1/2/3? I think examples of the strings you need parsed could prove helpful, or is it like user input and completely random? Lastly I'd argue the main argument against hacky regexes are usually not performance (unless run *many* times against large strings) but code maintenance and proneness to bugs. — kjetilh, Mar 11 '13 at 23:15
@kjetilh points taken. I'll provide a list of example inputs ASAP and also some solution code from my part. — Yauhen Yakimovich, Mar 11 '13 at 23:18
And yes, **var_dump(date_parse("Joe Soap was born on 12 February 1981"));** seems to do already a very good job. — Yauhen Yakimovich, Mar 11 '13 at 23:23

score 5 · Accepted Answer · answered Mar 12 '13 at 02:17

timelib

Well, date_parse is performing very very well and it was very educational to learn why. PHP function date_parse is a part of ext/date/lib or timelib, and apparently (despite lack of proper documentation) its implementation in C (written by Derick Rethans and called from the Zend Engine macros part with declarations) makes it a clever tool:

date_parse is already fuzzy: there are a lot of warnings (and complains) on the documentation page that function tolerates and parses too much but obviously it is actually a feature and not a bug (otherwise one should use date_parse_from_format or respective DateTime::createFromFormat())
date_parse uses (a lot of) regular expressions in a relatively smart way (based on re2c)
In addition to filtering this "scanner" looks for all possible combinations of words and date formats (from the list of known months and timezones), and, finally, just makes a "blindly" guess by looking for YYYY, MM and DD "separately" (very similar to what I need to do).
date_parse is a true compiled "scanner" that comes with look-ahead logic and error reporting that can be handled further by user (no exceptions, just messages inside the nested array of results).
There is even a python package wrapping the C code of timelib (so I am even not sure which is ultimately better in "parsing the monkey business" timelib or python-dateutil)

testing and examples

From my part, I have failed to find any input example from my dataset that was not parsed by date_parse, i.e.:

echo FuzzyDateParser::fromText('banana 1/2/3');
echo FuzzyDateParser::fromText('Joe Soap was born on 12 February 1981'));
echo FuzzyDateParser::fromText('2005 Feb., reprint'));
echo FuzzyDateParser::fromText('!'); # will fail to parse, producing an empty string.
echo FuzzyDateParser::fromText('monkey 2010-07-10 loves bananas and php');

The code for FuzzyDateParser class can be found in this gist. It can be useful as a template to handle errors and implement a fallback from date_parse results to own custom logic (which I eventually did not have to do for my case).

heuristic (fuzzy) date extraction from the string?

1 Answers1

timelib

testing and examples