1

I have a series of text blocks that contain a date written as "The first Wednesday of September, 2021" or "The third Monday in July, 2022", etc. I am not sure of the best way to extract the text and reformat it as a standard 'Month Day, Year' format. I have tried using the datefinder library with fuzzy matching on, but 'first Tuesday' and others have failed, I believe because it isn't a normal date format. Any ideas would be greatly appreciated, thanks all!

pbthehuman
  • 123
  • 3
  • 12
  • You will have to first parse the input yourself to split out the day, month and year after that you can use for example datetime to create date objects for further use. If all input dates follow the format you describe here parsing them should be pretty trivial. – binaryescape Aug 01 '23 at 19:04

1 Answers1

1

Assume all dates in the text are in The cardinal day_of_week of Month, Year format (You have to replace in with of in the second date):

import calendar
import re

text = [
    "The first Wednesday of September, 2021",
    "The third Monday of July, 2022",
    # more dates
]

pattern = r"The (\w+) (\w+) of (\w+), (\d{4})"

cardinal = {
    "first": 1,
    "second": 2,
    "third": 3,
    "fourth": 4,
    "fifth": 5
}


def find_nth_day_of_week(year_str, month_name, day_of_week, n_str):
    year = int(year_str)

    month = list(calendar.month_name).index(month_name.capitalize())
    if month == 0:
        return None

    n = cardinal.get(n_str.lower())
    if n is None:
        return None

    cal = calendar.monthcalendar(year, month)

    day_index = list(calendar.day_name).index(day_of_week.capitalize())

    nth_occurrence = [week[day_index] for week in cal if week[day_index] != 0]
    if n > len(nth_occurrence):
        return None

    day = nth_occurrence[n - 1]
    date = f"{calendar.month_abbr[month]} {day}, {year}"
    return date


def parse_text(text):
    match = re.match(pattern, text)
    if match:
        cardinal, day_of_week, month, year = match.groups()
        return find_nth_day_of_week(year, month, day_of_week, cardinal)
    return None


dates = [parse_text(block) for block in text]

for i, date in enumerate(dates):
    print(f"Date {i + 1}: {date}")
Byte Ninja
  • 881
  • 5
  • 13