1

Given a string with a date in an unknown format and other text, how can I separate the two?

>>dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
datetime.datetime(2010, 7, 10, 0, 0)

from Extracting date from a string in Python is a step in the right direction, but what I want is the non-date text, for example:

date = 2010-07-10
str_a = 'monkey', str_b = 'love banana'

If the date string didn't have spaces in it, I could split the string and test each substring, but how about 'monkey Feb 20, 2015 loves 2014 bananas'? 2014 and 2015 would both "pass" parse(), but only one of them is part of a date.

EDIT: there doesn't seem any reasonable way to deal with 'monkey Feb 20, 2015 loves 2014 bananas' That leaves 'monkey Feb 20, 2015 loves bananas' or 'monkey 2/20/2015 loves bananas' or 'monkey 20 Feb 2015 loves 2014 bananas' or other variants as things parse() can deal with.

Community
  • 1
  • 1
foosion
  • 7,619
  • 25
  • 65
  • 102
  • 2
    why 2015 is a year in your example while 2014 is not? The phrase is non-sense either way. – jfs Feb 21 '15 at 13:05
  • Fair point. Feb 20, 2015 is clearly a date, while 2014 is ambiguous. If you run it through parse(...,fuzzy=True), it considers 2014 hours and minutes. I'll edit the question. – foosion Feb 21 '15 at 13:16
  • Perhaps I should examine the source for parse(). – foosion Feb 21 '15 at 13:21
  • 1
    i'd start by trying a date parse at each offset ... if just one works, then use that ... if 2 or more offsets work, then you have a new problem. – Skaperen Feb 21 '15 at 13:23
  • @Skaperen split on spaces and consider any block that "passes" parse() as a date? Or do you mean something else? BTW, for `Feb 20, 2015` each offset would pass, but the parts that work would be contiguous. – foosion Feb 21 '15 at 13:49

2 Answers2

1

You can use regex to extract the words , and for get ride of month names you can check that your strings not in calendar.month_abbr and calendar.month_name:

>>> import clalendar
>>> def word_find(s):
...       return [i for i in re.findall(r'[a-zA-Z]+',s) if i.capitalize() not in calendar.month_name and i.capitalize() not in calendar.month_abbr]

Demo:

>>> s1='monkey Feb 20, 2015 loves 2014 bananas'
>>> s2='monkey Feb 20, 2015 loves bananas'
>>> s3='monkey 2/20/2015 loves bananas'
>>> s4='monkey 20 Feb 2015 loves 2014 bananas'
>>> print word_find(s1)
['monkey', 'loves', 'bananas']
>>> print word_find(s2)
['monkey', 'loves', 'bananas']
>>> print word_find(s3)
['monkey', 'loves', 'bananas']
>>> print word_find(s4)
['monkey', 'loves', 'bananas']

and this :

>>> s5='monkey 20 January 2015 loves 2014 bananas'
>>> print word_find(s5)
['monkey', 'loves', 'bananas']
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Consider Feb 20, 2015 or 20 February 2015. I could have a list of all full and abbreviated date strings, but that's tedious (and is may a date or not), especially when parse() can recognize dates. – foosion Feb 21 '15 at 13:10
  • Kasra that's worked on everything I've tried so far. – foosion Feb 21 '15 at 14:31
  • Kasra, sorry, was doing a bit more testing, then got distracted trying to again figure out why @ shows up in your response to me and I can't get it in my response to you. – foosion Feb 21 '15 at 14:43
  • @foosion ;) its ok! because this is my answer and if you left comment here i 'll get a notification any way so there is no need to @ ! – Mazdak Feb 21 '15 at 14:45
  • If I understand correctly, I see @ for your answers to me, but others don't. – foosion Feb 21 '15 at 14:47
  • @foosion no ... J.F. Sebastian uses @ in comment! – Mazdak Feb 21 '15 at 14:48
  • Also, you shouldn't define a function using the name of a built-in feature :-) – foosion Feb 21 '15 at 14:50
  • @foosion surely yes!;) – Mazdak Feb 21 '15 at 14:52
0

To find date/time in a natural language text and to return their positions in the input text and thus allowing to get non-date text:

 #!/usr/bin/env python
 import parsedatetime # $ pip install parsedatetime

 cal = parsedatetime.Calendar()
 for text in ['monkey 2010-07-10 love banana',
              'monkey Feb 20, 2015 loves 2014 bananas']:
     indices = [0]
     for parsed_datetime, type, start, end, matched_text in cal.nlp(text) or []:
         indices.extend((start, end))
         print([parsed_datetime, matched_text])
     indices.append(len(text))
     print([text[i:j] for i, j in zip(indices[::2], indices[1::2])])

Output

[datetime.datetime(2015, 2, 21, 20, 10), '2010']
['monkey ', '-07-10 love banana']
[datetime.datetime(2015, 2, 20, 0, 0), ' Feb 20, 2015']
[datetime.datetime(2015, 2, 21, 20, 14), '2014']
['monkey', ' loves ', ' bananas']

Note: parsedatetime failed to recognized 2010-07-10 as a date in the first string. 2010 and 2014 are recognized as a time (20:10 and 20:14) in both strings.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Doesn't 'failed to recognize' mean `parsedatetime` is not as good as recognizing valid date strings as `dateutil.parser.parse`? – foosion Feb 21 '15 at 13:46
  • @foosion: it depends on the input. It may be better at parsing human-readable date/time strings e.g., `cal.nlp('tomorrow')` works but `dateutil.parser.parse('tomorrow', fuzzy=True)` returns the default (wrong date). – jfs Feb 21 '15 at 13:51