I have a dataset (df_test) containing of several news articles (Text_4). Using SpaCy, I've extracted the 'DATE' entities. For those I want to see whether they are in the future or in the past (to identify news articles that reference future events such as product launches) compared to the article's publication date (RP_DateFormatted)
My current code is
for index, row in df_test.iterrows():
doc = nlp(row.Text_4)
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
... some other steps ... then:
ListDATE3 = [dateparser.parse(replace_all((i.text), od), languages=['en'],
settings={'RELATIVE_BASE': datetime.strptime(row.RP_DateFormatted, '%Y-%m-%d'),
'PREFER_DAY_OF_MONTH': 'last',
'PREFER_DATES_FROM': 'future'}) for i in entities['DATE']]
df_test.PY_Entities_DatesParsed[index] = ListDATE3
I have trouble with the line 'PREFER_DATES_FROM': 'future'
, for example:
Article was written on August 15th 2005 but no year is given in the text. SpaCy extracts "Aug 15" as Date. The dateparser sets the year to 2006 (because it is in the future). Consequently, I would then believe that the news article talks about the future - which it does not.
Setting 'PREFER_DATES_FROM': 'past'
would also not help me in a case when an event is described that happens in February (without a year given in the text). This is likely to be next February but the dateparser would set it to this year's February.
Is there a way to add an if statement to the settings or to create a new function based on the dateparser? Please note that each news articles can have multiple dates (entities['DATE'] is a list for each row in my dataframe).
I am using Python 3.8