1

I'm trying to create a date extractor from a string to be a catch all on YouTube videos for concerts. Many video titles are formatted as such:

PHISH Reba Comcast Center Hartford CT. 6/18/2010
Phish - It's Ice - November 30, 1991
PHISH - 11.30.91 I didn't know
Phish/Worcester,MA 12-31-91 Llama
Phish: Tube / Runaway Jim [HD] 2011-01-01 - New York, NY
Phish - Stash (Live) 12.29.93

Those are just a few of the examples. Basically dates can be anything from: MM-DD-YYYY to MM-DD-YY to YY-MM-DD etc. Each MM and DD can be 1 or two characters. Each YYYY can be 2 or 4 characters. The - character varies from a period, to a dash, to a slash and can be fixed by a simple /.?/ in Regex.

I began by stripping the whitespace and then running this simple Regex on the strings:

str.replace((new RegExp(' ', 'g')), '').match(/(([0-9]{1,4}).?([0-9]{1,2}).?([0-9]{1,4}))/)

// to highlight the regex:
// (([0-9]{1,4}).?([0-9]{1,2}).?([0-9]{1,4}))

This seems to work pretty well, but I also have to include the logic around which number is the year, which is the month, day, etc. along with detecting false positives.

Also, while I don't expect to be able to detect "November 2" as 11/2, that would be cool :)

Can anyone push me forward a bit or suggest any solutions? I don't want to use a library... I'd rather write specific code to this as it's not terribly complicated. Thanks

Here's a testing environment (open your console to see results) so you can play with the data easily. http://jsfiddle.net/ZNLxW/4/

switz
  • 24,384
  • 25
  • 76
  • 101
  • How would you know if a date was meant as DDMMYYYY or MMDDYYYY? e.g. What does 2/10/2012 mean? – Lee Taylor Sep 02 '12 at 21:05
  • Exactly. It doesn't have to be perfect, but I would love to get a best guess. 99% of the time, it will be MMDD, not DDMM, so we should suggest to the user MMDD. Also, there's some logic you can use, i.e. if it's 30121999, then it has to be December 30th (there's no month > 12). – switz Sep 02 '12 at 21:07
  • 2
    It sounds like you know approximately how you want the logic to work, so I'd write a bunch of test cases with a testing framework (or just a script using the `assert` module) and then knock out the instances one by one. – Michelle Tilley Sep 02 '12 at 21:15
  • Yeah, but I'm not entirely sure how to organize the function or the best way to accomplish each task in javascript. Should I just be using regex matches and then a bunch of if statements for each case? If someone is on IRC, I'd love to talk through it. switz on freenode. – switz Sep 02 '12 at 21:16
  • Is `101112` November 10, 2012 or October 11, 2012 or November 12, 2010 or December 11, 2010 (or October 12, 2011 or December 10, 2011)? – some Sep 02 '12 at 21:22
  • well, usually there will be terminators, so we will get something like 10.11.12. I would consider that to be read as directly as possible, which would equate to October 11, 2012. No one (US target demographic) would write that and expect anyone to understand it any other way. If they wrote 2011.11.12 however, we could take that as November 12, 2011. Once again, this is just a suggestion to users and does not have to be 110% accurate. Best guess is all we need. – switz Sep 02 '12 at 21:25
  • I'm testing with [regexpal](http://www.regexpal.com/) with `(\d{1,4}[/.-]\d{1,2}[/.-]\d{1,4})|(jan(uary)?|feb(bruary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\s+\d{1,2}[\s,]+\d{2}(\d{2})?` – some Sep 02 '12 at 21:37
  • @some that's completely unreadable, I would not create a single regexp to catch all. Use multiple instead, one per variant. – Maarten Bodewes Sep 02 '12 at 23:40
  • May I suggest some way of weighing the different options? One for the entire date, one for the component (day, month, year). You can then tweak the weights for specific options, and create cut-off weight (anything below is not a date). – Maarten Bodewes Sep 02 '12 at 23:44
  • @owlstead Regexps has a tendency to be hard to read. If you think that was unreadable, look at [this](http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html) (and yes, I have 'decoded' that one when I was learning regexp). You can look at the regexp in my answer below. I usually put all parts in a string so that I can add line breaks and comments if I like. It makes it more readable (well, I put it in an array that I join at the end) – some Sep 03 '12 at 01:17

2 Answers2

1

Here is my attempt (with jsfiddle)

function formatDate(date) {
  function lz(value,w) {
    value = value.toString();
    return "0000".slice(4-w + value.length) + value;
  }
  return [
    lz(date.getFullYear(),4),
    lz(date.getMonth()+1,2),
    lz(date.getDate(),2)
  ].join('-');
}

//RegExp with support for short (XdYdZ) and long (month_in_text day, year)
var reDate = new RegExp([
  '(?:', // Short format
    '\\b',
    '(\\d{4}|\\d{1,2})',    // field.short_value_1
    '\\s*([./-])\\s*',      // field.short_del_1
    '(\\d{1,2})',           // field.short_value_2
    '\\s*([./-])\\s*',      // field.short_del 2
    '(\\d{4}|\\d{1,2})',    // field.short_value_3
    '\\b',
  ')|(?:', // Long format
    '\\b',
    '(',                    // field.long_month
    'jan(?:uary)?|',
    'feb(?:ruary)?|',
    'mar(?:ch)?|',
    'apr(?:il)?|',
    'may|',
    'jun(?:e)?|',
    'jul(?:y)?|',
    'aug(?:ust)?|',
    'sep(?:tember)?|',
    'oct(?:ober)?|',
    'nov(?:ember)?|',
    'dec(?:ember)?',
    ')',
    '\\s+',                 // required space
    '(\\d{1,2})\\b',        // field.long_date
    '\\s*',                 // optional space
    ',?',                   // optional delimiter
    '\\s*',                 // optinal space
    '(\\d{4}|\\d{2})\\b',   // field.long_year
  ')'
].join(''),'i');


//Month names, must be 3 chars lower case.
//Used to convert month name to number.
var  monthNames = [
  'jan', 'feb', 'mar', 'apr', 'may', 'jun',
  'jul','aug', 'sep', 'oct', 'nov', 'dev'
];


var extractDateFromString = function(str) {
    var m = str.match(reDate);
    var date;
    if (m) {
      var idx=-1;
      //Convert array form regexp result to named variables.
      //Makes it so much easier to change the regexp wihout
      //changing the rest of the code.
      var field = {
        all : m[++idx],
        short_value_1 : m[++idx],
        short_del_1 : m[++idx],
        short_value_2 : m[++idx],
        short_del_2 : m[++idx],
        short_value_3 : m[++idx],
        long_month : m[++idx],
        long_date : m[++idx],
        long_year : m[++idx]
      }

      //If field.long_month is set it is a date formated with named month
      if (field.long_month) {
        var month = monthNames.indexOf(
            field.long_month.slice(0,3).toLowerCase()
        );
        // TODO: Add test for sane year
        // TODO: Add test for sane month
        // TODO: Add test for sane date
        date = new Date(field.long_year,month,field.long_date);
      } else {
        // Short format: value_1 del_1 value_2 del_2 value_3
        var year, month, day;

        if (field.short_del_1 != field.short_del_2) {
          if (
            field.short_del_1 === '/' &&
            field.short_del_2 === '-'
          ) {
            // DD/MM-YYYY
            year = field.short_value_3;
            month = field.short_value_2;
            day = field.short_value_1;
            console.log('DMY',field.all,+year,+month,+day);
          } else {
            // TODO: Add other formats here.
            // If delimiters don't match it isn't a sane date.
            console.log('different delimiters');
          }
        } else {
          // assmume YMD if
          //   (delimiter = '-' and value_3 < 31)
          //   or (value_1 > 31) 
          if (
            (field.short_del_1 == '-' || field.short_value_1 > 31)
            && (field.short_value_3 < 32)
           ) {
            // YMD
            year = field.short_value_1;
            month = field.short_value_2;
            day = field.short_value_3;
            console.log('YMD',field.all,+year,+month,+day);
          } else {
            // MDY
            year = field.short_value_3;
            month = field.short_value_1;
            day = field.short_value_2;
            console.log('MDY',field.all,+year,+month,+day);
          }
        }

        if (year !== undefined) {
          year = +year; //convert to number
          //Handle years without a century 
          //year 00-49 = 2000-2049, 50-99 = 1950-1999
          if ( year < 100) {
            year += year < 50 ? 2000:1900;
          }
          date = new Date(year,+month-1,+day);
        }
      } 
    }

    var div = document.createElement('div');
    div.className = date ? 'pass' : 'fail';
    div.appendChild(document.createTextNode(date?formatDate(date):'NaD'));
    div.appendChild(document.createTextNode(' ' + str));
    document.body.appendChild(div);    
}
for (var i = 0; i < dates.length; i++) {
    extractDateFromString(dates[i].name)
}

Updated: Tweaked the regexp for long format (added \b)

Updated: Tweaked the regexp again. No more 3 digit fields. (either 1, 2 or 4)

some
  • 48,070
  • 14
  • 77
  • 93
  • This is great! I've gone ahead and modified a bit for my own needs. One of the best SO answers that I've ever received. Thank you! – switz Sep 03 '12 at 23:43
  • Thank you @Switz ! I'm happy to help. By the way I found your question about `switch`-statements and added an [answer here](http://stackoverflow.com/a/12259830/36866) there, because the highest voted solution is about 30 times slower in Chrome than the fastest tested solution. You might want to check it out. – some Sep 04 '12 at 08:56
-1

javascript has a Date object already:

http://www.w3schools.com/jsref/jsref_obj_date.asp

try this:

var d = new Date(dateString);
Kulik
  • 24
  • 3
  • Yes, it has, but it is very limited in what it parses. For example `01.02.03` and `01-02-03` (two of the formats OP want) gives an invalid date. – some Sep 03 '12 at 03:22
  • >var d = new Date('11.30.91'); >console.log(d); Sat Nov 30 1991 00:00:00 GMT+0100 (CET) >var d = new Date('11-30-91'); >console.log(d); Sat Nov 30 1991 00:00:00 GMT+0100 (CET) maybe its *not crossbrowser* :S – Kulik Sep 03 '12 at 05:54
  • [ECMAScript-262:5](http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf) 15.9.4.2: "The function first attempts to parse the format of the String according to the rules called out in Date Time String Format (15.9.1.15). If the String does not conform to that format the function may fall back to any implementation-specific heuristics or implementation-specific date formats. Unrecognisable Strings or dates containing illegal element values in the format String shall cause Date.parse to return NaN." – some Sep 03 '12 at 06:25
  • @some that opens up a floodgate of incompatible implementations, better not to use it outside the minimum supported formats...good call – Maarten Bodewes Sep 03 '12 at 22:29
  • @owlstead Exactly. It varies between different browser vendors but that is not enough, it also varies between different versions from the same vendor. That's why it is better to parse it by yourself if you know what format it is in. – some Sep 04 '12 at 09:05