I have a problem to heuristically parse a string of text which contains a date but in a rather arbitrary (unknown) format.
function parseDateStr($text) {
$cleanText = filter($text);
# ...
$day = findDay($cleanText);
$month = findMonth($cleanText);
$year = findYear($cleanText);
# .. assert constraints, parse again or fail
return sprintf('%04d-%02d-%02d', $year, $month, $day)
}
Input text is a sentence in English language plus arbitrary syntax symbols (like a subset of \W regexp class). The task of the algorithm is to extract date only after filtering away any potential garbage (noisy) words, unrelated to the date. It is allowed that the algorithm could fail and return no result. If only two combination of two joined digits (MM) together with four other digits (YYYY) were found in the string - it is assumed that two digits corresponds to the month of the date and the day is taken to be 01 (first day of the month). Result gives a date in "YYYY-MM-DD" (SQL) format (of type DATE).
My idea is to proceed with designing a series of filters using preg_replace & co. Further, use logical constraints on the range of $year, $day, use a vocabulary for $month, etc., but I would not be surprised if similar but more elegant solutions or approaches are thinkable or already exist. If so, please let me know about them. I would also appreciate if any critics or potential pitfalls can be pointed out.
Relation to similar questions:
Please note that the question is different from more basic date parsing questions as:
since in my case I can not specify or determine the format of the string. On the other hand the following questions talk about similar tasks:
- Extracting date from a string in Python
- Extract multiple date format from few string variables in php
- Extracting date from a string in PHP
I am not sure if the last one is a duplicate, it is not ultimately clear to me what OP wants to parse (although checkdate and date_parse seem to be partially useful). But the first question on the whole "mokey business" is also true for my case and has been addressed by fuzzy parsing as in
dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Finally, the second one contains great grabbing regexp (almost "fuzzy").
PS by elegant I understand that the code is rather compact (there is no significant limitations on performance, so using "hacky" regexps is ok).