0

I am using PrettyTime NLP to find dates from a list.

Example

ABC High School March 5, 2016
XYZ High School 08/20/2016 Gym

When I parse using PrettyTimeNLP, it gives me a list of dates in this format.

Sat Aug 20 10:05:27 EDT 2016

My question is if it is possible to parse the string, and then split it before or after the date so I can have

string1 = 'XYZ High School'
stirng2 = '08/20/2016'
string3 = 'Gym' 

I know I can use RegEx to do the job but the example here is a simple one. My document will be 1-10 pages long and contain various formats of dates.

Any examples of how to manipulate PrettyTime will be appreciated.

monty_bean
  • 494
  • 5
  • 25
  • 1
    Without a delimiter, or fixed width fields, or using regex to handle all the expected date formats, how can you tell where in the string the date begins and ends? – Emmanuel Rosa Mar 18 '16 at 00:32
  • @EmmanuelRosa, Yeah... I was hoping since PrettyTime NLP recognize the natural language dates already, maybe there was a way to get that variable and the rest of it. I tried to decipher the code, but I'm not an expert. I was wrestling with the idea yesterday and I guess I will use the PTNLP to recognize dates and then use RegEx to extract the dates and rest of the line. Thank you Emmanuel. – monty_bean Mar 18 '16 at 15:00

1 Answers1

1

The DateGroup provided by PrettyTimeParser.parseSyntax() contains some of the information needed to answer your question. The rest of the information can be determined from the original text.

@GrabResolver(name='sonatype-snapshots', root='https://oss.sonatype.org/content/repositories/snapshots/')
@Grab('org.ocpsoft.prettytime:prettytime-nlp:4.0.1.Final')

import org.ocpsoft.prettytime.nlp.PrettyTimeParser

def list = [
    'ABC High School March 5, 2016',
    'XYZ High School 08/20/2016 Gym'
]

def parser = new PrettyTimeParser()

list.collect {
    [rawText: it, dateGroup: parser.parseSyntax(it).head()]
}.collect {
    def before = 0..<it.dateGroup.position
    def after = it.dateGroup.position + it.dateGroup.text.size()..<it.rawText.size()

    [
        before: it.rawText[before].trim(),
        date: it.dateGroup.dates.head(),
        dateString: it.dateGroup.text,
        after: it.rawText[after].trim()
    ]
}

NOTE: Don't use the @Grabs in Grails, you should already have the dependencies set up.

How it works

The example above uses the entire original text along with the position in which Pretty Time found the date, and the text which was parsed into a date, to create two ranges: one for the text before the date, and another for the text after the date. These two ranges are then used against the entire original text to extract the three components. OK... four, I added the Date. The output looks like this:

[
    [
        before:ABC High School, 
        date:Sat Mar 05 11:45:56 EST 2016, 
        dateString:March 5, 2016, 
        after:
    ], 
    [
       before:XYZ High School, 
       date:Sat Aug 20 11:45:56 EDT 2016, 
       dateString:08/20/2016, 
       after:Gym
    ]
]
Emmanuel Rosa
  • 9,697
  • 2
  • 14
  • 20
  • Thank you very much, man. This was exactly what I was looking for. I searched for `DateGroup` after your answer and I did read about it but didn't have any clues. Now it makes sense and there are other methods that are interesting. Thanks again. – monty_bean Mar 18 '16 at 17:11
  • Hello~ I was trying to understand your code and had a couple questions. Why did you use `parseSyntax` and `.head()` instead of just `parse()`? I understood the rest of the code but those two. Also what is the difference between `List` and `List`? I can't seem to find a clear answer. Thanks. – monty_bean Mar 21 '16 at 17:42
  • 1
    I chose `parseSyntax()` over `parse()` because `parse()` returns a `java.util.Date` which obviously doesn't contain any info about the text that was parsed to produce the `Date`. `parseSyntax()` on the other hand returns a list of Pretty Time `DateGroup`s. `head()` returns the first item in a collection, and is used to grab the first `DateGroup`. So my example expects there to be only one date per line. – Emmanuel Rosa Mar 21 '16 at 17:48
  • Many thanks!! I really appreciate your time and talent. – monty_bean Mar 21 '16 at 18:29