4

I tried package dateutil to extract date part from string. It works good if the exact date included in the string, like:

from dateutil.parser import parse
try: 
    date = parse(string, fuzzy=True)
    print(str(date)[:10])
except ValueError:
    print("no date in text")

string = "an example of date:8 march 2019"
output: 2019-03-08

string = "an example of date: 2019/3/8"
output: 2019-03-08

string = "an example of pure string"
output: no date in text

But when a number is included in string instead of date, it goes wrong and considers it as a date:

string = "an example of wrong date: 8"

output: 2022-03-08

My question here is, how can I use this package or similar packages to solve this problem. There are some posts related to extracting dates, like Extract date from string in python, but they have not covered this topic and they work for specific date format.

Your help much appreciated!

ctrl-alt-delor
  • 7,506
  • 5
  • 40
  • 52
Sam S.
  • 627
  • 1
  • 7
  • 23

1 Answers1

1

It seems that you want to exploit the powerful ability of dateutil module to parse free-form dates but the default variety of dates it attempts to parse and the default normalization rules (using the current month/year when it is missing from the date) is not what you need.

One of the things you can do, is not to attempt parsing the value as a date using dateutil if it is parseable as integer value or when no digit is in the string to be parsed.

So my suggestion to satisfy these two pre-conditions (and you can extend the list therefore eliminating the default misinterpretations of dateutil in your case):

import re
from dateutil.parser import parse
try: 
    v = int(string)
    print("Seems like integer.")
except ValueError:  # requires that the date does not parse as proper int
    if re.search( r'\d', string) is not None:  # requires a digit in the string 
        try:
           date = parse(string, fuzzy=True)
           print(str(date)[:10])
        except ValueError:
           print("no date in text")
    else:
        print("Can't parse")
sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thanks Sophros for the answer, however it looks like it does not produce any output, please try for string = "today 8" or "today 8 nov 2022" . The first one is just an integer and the second one is a date. – Sam S. Dec 11 '22 at 23:58
  • This is not true. "today 8" - it is a string and an integer. – sophros Dec 15 '22 at 22:09
  • I mean your code does not give any output/result for any input; I tried it in jupyter notebook. Could you please test your code for some input like "today 8" or "today 8 nov 2022" or those input examples mentioned in the original question? – Sam S. Dec 17 '22 at 10:54
  • Indeed this was an obvious misuse in the `re.search` - I swapped the pattern and the string to search for the pattern in the arguments. Corrected. – sophros Dec 18 '22 at 10:09