For a NLP project in python I need to generate random dates for model training purpose. Particularly, the date format must be random and coherent with a set of language locales. The formats includes those with only numbers and formats with (partially) written out day and month names, and various common punctuations.
My best solution so far is the following algorithm:
- generate a
datetime()
object with random values (nice solution here) - randomly select a locale, i.e. pick one of
['en_US','fr_FR','it_IT','de_DE']
where in this case this list is well known and short, so not a problem. - randomly select a format string for
strftime()
, i.e.['%Y-%m-%d','%d %B %Y',...]
. In my case the list should reflect potentially occuring date formats in the documents that will be exposed to the NLP model in the future. - generate a sting with
strftime()
Especially for 3) i do not know a better version than to hardcode the list of what I saw manually within the training documents. I could not yet find a function that would turn ocr-dates into a format string, such that i could extend the list when yet-unseen date formats come by.
Do you have any suggestions on how to come up with better randomly formatted dates, or how to improve this approach?