I'm working on a sensitive data recognition (NER) task. Faced with the fact that I can not accurately detect dates in texts. I've tried almost everything...
For example I have this type of dates in my text:
date_list = ['23 octbr', '08/10/1975', '2/20/1961', 'December 23', '2021', '1/10/1980', ...]
But I must say that there is also a lot of numerical information in the text, for example, IP addresses, house addresses, bank card numbers, etc.
This is an example of how Spacy
works:
'08/10/1975' -> Entityt type: No Entity
'2/20/1961' -> Entityt type: DATE
'1/10/1980' -> Entityt type: CARDINAL
Or for example I have phone number "(150) 224-2215"
and it Spacy
marks the part "24-2215"
as a Date. It often happens with adresses and credit card numbers too.
Then I have tried datefinder
and dateparser.search
, but they detected completely incorrect parts of the sentence or those that contained the word "to".
Can you please share your experience, what could work better? What is the best way to get high accuracy of date detection?