1

I have a dataset (df_test) containing of several news articles (Text_4). Using SpaCy, I've extracted the 'DATE' entities. For those I want to see whether they are in the future or in the past (to identify news articles that reference future events such as product launches) compared to the article's publication date (RP_DateFormatted)

My current code is

for index, row in df_test.iterrows():
doc = nlp(row.Text_4)
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

... some other steps ... then:

        ListDATE3 = [dateparser.parse(replace_all((i.text), od), languages=['en'],
                 settings={'RELATIVE_BASE': datetime.strptime(row.RP_DateFormatted, '%Y-%m-%d'),
                           'PREFER_DAY_OF_MONTH': 'last',
                           'PREFER_DATES_FROM': 'future'}) for i in entities['DATE']]
        df_test.PY_Entities_DatesParsed[index] = ListDATE3

I have trouble with the line 'PREFER_DATES_FROM': 'future', for example: Article was written on August 15th 2005 but no year is given in the text. SpaCy extracts "Aug 15" as Date. The dateparser sets the year to 2006 (because it is in the future). Consequently, I would then believe that the news article talks about the future - which it does not.

Setting 'PREFER_DATES_FROM': 'past' would also not help me in a case when an event is described that happens in February (without a year given in the text). This is likely to be next February but the dateparser would set it to this year's February.

Is there a way to add an if statement to the settings or to create a new function based on the dateparser? Please note that each news articles can have multiple dates (entities['DATE'] is a list for each row in my dataframe).

I am using Python 3.8

AlexanderP
  • 126
  • 6

1 Answers1

-1

I don't think you're going to be able to solve this just with options to DateParser. That interprets dates mechanically given a string, but in order to tell whether these dates are in the past or future you're using knowledge of the surrounding words and context of the article ("at next February's festival...").

This is a pretty hard thing to get right in an automated system. In NLP research this is referred to as "grounding", and includes related problems, like telling who "President of the United States" refers to (what year was it?), or what color "red" is (is it red like a stop sign, or red like red hair?).

What I would do is start by using rule-based techniques to identify whether dates are in the past or future before passing them to date parser. So take some words from around date entities, and if "last" is there then it's in the past, if "next" is there then it's in the future, that sort of thing. See how well it does. (You might think you could just take words before the date entity, but you can also have "February last year was really cold" or something.)

If you want to try a statistical system after that, you could look at using the spancat in spaCy with different kinds of context windows to classify dates as "future" or "past".

polm23
  • 14,456
  • 7
  • 35
  • 59
  • I didn't downvote, unsure who did and why. If there is a clear indication such as "next April", I believe SpaCy already indicates that. My issue is primarily around dates without much context - for example the publishing date that is inside the text and I did not find a good pattern to filter it out yet. – AlexanderP Nov 10 '21 at 13:00
  • Ah, that comment wasn't addressed to you, just to whoever it was who did it, no worries. If there's no context in the document it'll be hard to automate, but you could use a heuristic, something like evaluating for past and future and seeing which date is closer. – polm23 Nov 10 '21 at 14:59
  • Great idea! How would I do this? – AlexanderP Nov 11 '21 at 16:05