1

I am trying to pull dates from news and government announcement texts I have scraped on Covid-19 in Hawaii. I have run a sample program on a dummy data set, and find dates generated for every number on the page. When I use "strict=True" there are no dates at all. Here are the results for a 4 line file.

import datefinder
with open("c:/users/Lnitz/documents/ige2.txt") as file:
    for line in file:
        matches = datefinder.find_dates(line, source=True)
        #print(line)
        for match in matches:
                print(match, 'xxx', line)

Result:

(datetime.datetime(2020, 11, 19, 0, 0), 'on Nov 19, 2020') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases  

(datetime.datetime(1998, 10, 24, 0, 0), '98') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases  

(datetime.datetime(2021, 10, 14, 0, 0), '14') xxx Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases  

(datetime.datetime(2021, 10, 19, 0, 0), '19') xxx Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi 

(datetime.datetime(1945, 3, 23, 0, 0), '3/23/1945') xxx Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi 

(datetime.datetime(1878, 3, 5, 0, 0), 'Mar 5,1878') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing

(datetime.datetime(1972, 10, 24, 0, 0), '72') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing

(datetime.datetime(1978, 10, 24, 0, 0), '78') xxx Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • I'm not sure I understand what the specific question is. Could you maybe clarify by specifying combinations of input + expected output? In general, I think it is not that simple to detect dates in random text snippets, so this will definitively require some tweaking, no matter which library you might use. – FObersteiner Oct 25 '21 at 09:49
  • Thank you for the response. I have blocks of articles or opinion pieces on Hawaii's Covid response. Most have a regular format date, April 14, 2021 and other non-date numbers. Without "strict" the algorithm throws out a date for every extra number--19 becomes 10-19-2021, taking the current month and year, 78 becomes 10-26-1978, taking today's date and month. My desired output would be extraction of all month-day-year dates (in any format or order) and no dates generated by other single or two digit numbers not part of a date format. – Lawrence Nitz Oct 26 '21 at 20:23

1 Answers1

0

datefinder's output contains the source string if you set source=True, so what about post-processing that? For example for a fully described date (y/m/d) you need at least 6 characters (including the separator) and 4 digits:

import datefinder

s = """Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases
Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi
Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing"""

for l in s.split('\n'):
    matches = datefinder.find_dates(l, strict=False, source=True)
    for m in matches:
        if (sum(c.isdigit() for c in m[1]) >= 4) and (len(m[1]) >= 6):
            print(f"{l} ->\n{m}\n")

# Posted on Nov 19, 2020 in COVID-98 News Releases, Latest News, Press 14 Releases ->
# (datetime.datetime(2020, 11, 19, 0, 0), 'on Nov 19, 2020')

# Pre-travel COVID-19 testing results must be in hand prior to 3/23/1945 departure for Hawaiʻi ->
# (datetime.datetime(1945, 3, 23, 0, 0), '3/23/1945')

# Air Canada and WestJet Mar 5,1878 partnering 72 with State of Hawaiʻi for Canadian15 pre-travel 78 testing ->
# (datetime.datetime(1878, 3, 5, 0, 0), 'Mar 5,1878')
FObersteiner
  • 22,500
  • 8
  • 42
  • 72