0

For a NLP project in python I need to generate random dates for model training purpose. Particularly, the date format must be random and coherent with a set of language locales. The formats includes those with only numbers and formats with (partially) written out day and month names, and various common punctuations.

My best solution so far is the following algorithm:

  1. generate a datetime() object with random values (nice solution here)
  2. randomly select a locale, i.e. pick one of ['en_US','fr_FR','it_IT','de_DE'] where in this case this list is well known and short, so not a problem.
  3. randomly select a format string for strftime(), i.e. ['%Y-%m-%d','%d %B %Y',...]. In my case the list should reflect potentially occuring date formats in the documents that will be exposed to the NLP model in the future.
  4. generate a sting with strftime()

Especially for 3) i do not know a better version than to hardcode the list of what I saw manually within the training documents. I could not yet find a function that would turn ocr-dates into a format string, such that i could extend the list when yet-unseen date formats come by.

Do you have any suggestions on how to come up with better randomly formatted dates, or how to improve this approach?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Dr-Nuke
  • 368
  • 3
  • 11

2 Answers2

0

USE random.randrange() AND datetime.timedelta() TO GENERATE A RANDOM DATE BETWEEN TWO DATES

Call datetime.date(year, month, day) to return a datetime object representing the time indicated by year, month, and day. Call this twice to define the start and end date. Subtract the start date from the end date to get the time between the two dates. Call datetime.timedelta.days to get the number of days from the previous result datetime.timedelta. Call random.randrange(days) to get a random integer less than the previous result days. Call datetime.timedelta(days=n) to get a datetime.timedelta representing the previous result n. Add this result to the start date.

start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 2, 1)

time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)

print(random_date)
theroyakash
  • 143
  • 1
  • 4
0

Here is my solution. Concerning the local, all need to be available on your computer to avoid error

    import random
    from datetime import datetime, timedelta
    import locale
    
    LOCALE = ['en_US','fr_FR','it_IT','de_DE'] # all need to be available on your computer to avoid error
    DATE_FORMAT = ['%Y-%m-%d','%d %B %Y']
    
    def gen_datetime(min_year=1900, max_year=datetime.now().year):
        # generate a datetime
        start = datetime(min_year, 1, 1)
        years = max_year - min_year + 1
        end = start + timedelta(days=365 * years)
        format_date = DATE_FORMAT[random.randint(0, len(DATE_FORMAT)-1)]
        locale_date = LOCALE[random.randint(0, len(LOCALE)-1)]
        locale.setlocale(locale.LC_ALL, locale_date) # generate error if local are not available on your computer
    
        return (start + (end - start) * random.random()).strftime(format_date)
    
    date = gen_datetime()
    
    print(date)
Inadel
  • 101
  • 7