2

I know there are lots of Q&As to extract datetime from string, such as dateutil.parser, to extract datetime from a string

import dateutil.parser as dparser
dparser.parse('something sep 28 2017 something',fuzzy=True).date()

output: datetime.date(2017, 9, 28)

but my question is how to know which part of string results this extraction, e.g. i want a function that also returns me 'sep 28 2017'

datetime, datetime_str = get_date_str('something sep 28 2017 something')
outputs: datetime.date(2017, 9, 28), 'sep 28 2017'

any clue or any direction that i can search around?

FF0605
  • 441
  • 7
  • 17
  • You can use `strftime` to convert datetime object to string (https://stackoverflow.com/questions/2158347/how-do-i-turn-a-python-datetime-into-a-string-with-readable-format-date/22842734) – Sruthi Dec 04 '18 at 03:24
  • I don't think this is particularly simple to do, unfortunately. The string that dateutil found is not guaranteed to be contiguous, so "It happened on Sep 29, 1947 at around 8:45" will return `1947-09-29 08:45`. That said, `fuzzy_with_tokens` is a good place to start. It gives you the inverse of what you want. – Paul Dec 04 '18 at 03:31
  • @Paul thanks, I'm also thinking of `fuzzy_with_tokens` which list a number of tokens to exclude, but it includes space `('something ', ' ', 'something')` which made it a bit challenging. I will try digging a way of it and post my solution later in this direction – FF0605 Dec 04 '18 at 03:37
  • @SruthiV thanks, but I'm finding the original string rather than convert datetime to string, datetime package has a simpler solution for such conversion by append`.strftime("%Y-%m-%d")` – FF0605 Dec 04 '18 at 03:40

2 Answers2

2

Extend to the discussion with @Paul and following the solution from @alecxe, I have proposed the following solution, which works on a number of testing cases, I've made the problem slight challenger:

Step 1: get excluded tokens

import dateutil.parser as dparser

ostr = 'something sep 28 2017 something abcd'
_, excl_str = dparser.parse(ostr,fuzzy_with_tokens=True)

gives outputs of:

excl_str:     ('something ', ' ', 'something abcd')

Step 2 : rank tokens by length

excl_str = list(excl_str)
excl_str.sort(reverse=True,key = len)

gives a sorted token list:

excl_str:   ['something abcd', 'something ', ' ']

Step 3: delete tokens and ignore space element

for i in excl_str:
    if i != ' ':
        ostr = ostr.replace(i,'') 
return ostr

gives a final output

ostr:    'sep 28 2017 '

Note: step 2 is required, because it will cause problem if any shorter token a subset of longer ones. e.g., in this case, if deletion follows an order of ('something ', ' ', 'something abcd'), the replacement process will remove something from something abcd, and abcd will never get deleted, ends up with 'sep 28 2017 abcd'

FF0605
  • 441
  • 7
  • 17
1

Interesting problem! There is no direct way to get the parsed out date string out of the bigger string with dateutil. The problem is that dateutil parser does not even have this string available as an intermediate result as it really builds parts of the future datetime object on the fly and character by character (source).

It, though, also collects a list of skipped tokens which is probably your best bet. As this list is ordered, you can loop over the tokens and replace the first occurrence of the token:

from dateutil import parser


s = 'something sep 28 2017 something'
parsed_datetime, tokens = parser.parse(s, fuzzy_with_tokens=True)

for token in tokens:
    s = s.replace(token.lstrip(), "", 1)

print(s)  # prints "sep 28 2017"

I am though not 100% sure if this would work in all the possible cases, especially, with the different whitespace characters (notice how I had to workaround things with .lstrip()).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195