2

Let's say I have a particular date like January 10, 2013.

I'd like to be able to search a text or html document to see if it contains a reference to that date. I'd like to account for the date being in any of a number of formats, for instance:

1/10/2013  
01/10/13  
2013-01-10
10-Jan-2013  
January 10, 2013  
Jan 10, 2013

... should all produce a (+) matching result for January 10, 2013.

I recognized that swapping the day-month order could be problematic, but I would be willing to accept a false positive result in this case, meaning:

01-10-2013
10-01-2013

... would both be acceptable for January 10, 2013 in my case.

Is there an established algorithm implemented in any language that performs this sort of generalized, but non-trivial, search? My preference would be something in Ruby or JavaScript, but I'd be interested in any well considered example. ADDENDUM #1

I see this code:

def validate_date(date_str)
  valid_formats = ["%m/%d/%Y", "%m/%d/%Y %I:%M %P"] 
  #see http://www.ruby-doc.org/core-1.9.3/Time.html#method-i-strftime for more

  valid_formats.each do |format|
    valid = Time.strptime(date_str, format) rescue false

    return true if valid
  end

  return false
end

here.

... which would be good way to handle numerical representation of dates. This leaves Month names unaccounted for. With 1, 01, Jan, and January all representing the first month of the year, I am wondering if the large number of permutations has been well handled somewhere else.

Community
  • 1
  • 1
Perry Horwich
  • 2,798
  • 3
  • 23
  • 51
  • If you know the date and enter it as 06/10/2013 would it not be possible to produce a regular expression from this data that includes all the options you require? – jing3142 Jun 10 '13 at 16:50
  • This does seem possible to me. Just wondering if someone has gone before me here. I expect I could come up with some patterns to use, but perhaps there is a more complete and well considered option already out there? – Perry Horwich Jun 10 '13 at 17:59
  • Can the down-voter explain their dissatisfaction? – Perry Horwich Jun 10 '13 at 18:03
  • I'm not the down-voter, but I would say you have expressed the problem well, but not given any indication of what you have done to help yourself. In this case I wouldn't expect to see your code, since you are looking for example or library. But you could explain where and how you have looked for examples so far. – Neil Slater Jun 10 '13 at 18:10
  • Thx Neil. My search included google and SO. Then I posted. There are some problems that feel like, "Someone must have had to do this before" I thought this might be one of those cases, but I found no specific example. Frankly, I am not sure how to condense the search terms here. Something like, "One-to-many date format matching" perhaps. I guess the downvote would be easier to swallow if the downvoter had also posted a link showing my question is a duplicate. – Perry Horwich Jun 10 '13 at 18:20
  • Added additional information to question – Perry Horwich Jun 10 '13 at 18:38
  • Do you remember programming language lexing and parsing? this is similar. your task can be reduced to writing a lexical analyzer. most tokens will be like (\w)+, but tokens for dates will be something like (\d\d)\s(\d\d)\s(\d\d\d\d). just define a regular expression for each token (your dates). – akonsu Jun 10 '13 at 18:43

1 Answers1

2

I'm not aware of any preexisting solutions for this, but it's not complicated to write your own. Make an array of the date formats you'd like to search for, then simply iterate over the formats, formatting your date and searching your document:

require 'date'

formats = ["%-m/%e/%Y",
           "%m/%d/%Y",
           "%Y-%m-%d",
           "%d-%b-%Y",
           "%B %d, %Y",
           "%b %d, %Y"]

d = Date.new(2013, 1, 10)

formats.each do |format|
  search_string = d.strftime(format)
  # Do your search for `search_string`
end

Update: A somewhat more complicated, more efficient method would be to turn the search strings into a Regexp:

require 'date'

formats = ["%-m/%e/%Y",
           "%m/%d/%Y",
           "%Y-%m-%d",
           "%d-%b-%Y",
           "%B %d, %Y",
           "%b %d, %Y"]

d = Date.new(2013, 1, 10)

regex = Regexp.union(formats.map{|f| Regexp.new(Regexp.quote(d.strftime(f)))})
# Search document for regex
Darshan Rivka Whittle
  • 32,989
  • 7
  • 91
  • 109
  • @akonsu Yes, it's inefficient, but it's very simple. Depending on the size of the documents to be searched, this may very well be good enough. Depending on that and other factors (is it necessary to search for multiple dates or just one?) it may be worth defining a concept of "date-like", scanning the document once for "date-like" things and only testing against them. – Darshan Rivka Whittle Jun 10 '13 at 20:06
  • @akonsu I just added another option that only scans the document once. – Darshan Rivka Whittle Jun 10 '13 at 20:33
  • Thanks. This has helped me. You also got me to go here: http://www.foragoodstrftime.com/ which helped me to understand your code better. Thanks again. – Perry Horwich Jun 10 '13 at 22:39
  • @PerryHorwich Happy to help. [The documentation](http://ruby-doc.org/stdlib-2.0/libdoc/date/rdoc/Date.html#method-i-strftime) includes all the format directives supported by Ruby. – Darshan Rivka Whittle Jun 10 '13 at 23:16