0

I'm working on a sensitive data recognition (NER) task. Faced with the fact that I can not accurately detect dates in texts. I've tried almost everything...

For example I have this type of dates in my text:

date_list = ['23 octbr', '08/10/1975', '2/20/1961', 'December 23', '2021', '1/10/1980', ...]

But I must say that there is also a lot of numerical information in the text, for example, IP addresses, house addresses, bank card numbers, etc.

This is an example of how Spacy works:

'08/10/1975' -> Entityt type: No Entity
'2/20/1961' -> Entityt type: DATE
'1/10/1980' -> Entityt type: CARDINAL

Or for example I have phone number "(150) 224-2215" and it Spacy marks the part "24-2215" as a Date. It often happens with adresses and credit card numbers too.

Then I have tried datefinder and dateparser.search, but they detected completely incorrect parts of the sentence or those that contained the word "to".

Can you please share your experience, what could work better? What is the best way to get high accuracy of date detection?

martineau
  • 119,623
  • 25
  • 170
  • 301
hidden layer
  • 81
  • 1
  • 1
  • 6
  • What is your definition of the "best way"? – martineau Oct 28 '21 at 17:24
  • @martineau, best accuracy to have. Maybe someone had the same task at work and there is experience to share. – hidden layer Oct 28 '21 at 17:26
  • In that case, you may have to do it multiple ways and then pick from the results. Having some way of programmatically being able to reject false positives would be very helpful. – martineau Oct 28 '21 at 17:37

2 Answers2

2

What does your corpus include, does it include full sentences?

  • First of all you can try spaCy NER with context. NER algorithms work on full sentences.

  • If you look for a more token/shape oriented solution, I suggest context free parsing. A context free grammar is great for describing dates. Basically you define some grammar rules such as:

calendar_year -> full_year | year
year -> 19\d{,2} | 20\d{,2}
full_year -> day/month/year | day.month.year
day -> digit_num | two_digit_num
month -> digit_num | two_digit_num
digit_num -> 0 | 1 | 2 ... |9

Regex is not a good idea here, because it has no "context" i.e. parsed characters are not aware of what have been parsed before, there is no memory. Context free grammars offer a structured way to parse strings and offer parse trees as well.

This is how I did it with Lark, dates are in German: https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/

Duygu
  • 106
  • 1
  • 5
  • Sometimes dates in full sentences, but mostly they don't For example, I have table with employee data, name of the column = "date of birth". And then goes [25.10.1991,...] But when I parse it in python, it will be without context. – hidden layer Oct 28 '21 at 20:12
  • And thanks for the article, i will look through. – hidden layer Oct 28 '21 at 20:15
  • OK, quick but stupid question:) Name of the column is "date of birth" so everything has to be a date. If you parse a CSV/excel with pandas, you can see the column names as well. Column names should give some clues definitely. – Duygu Oct 28 '21 at 20:26
  • I have this table in .docx file :( – hidden layer Oct 28 '21 at 20:42
  • How about sth like this: https://medium.com/@karthikeyan.eaganathan/read-tables-from-docx-file-to-pandas-dataframes-f7e409401370 – Duygu Oct 28 '21 at 21:12
  • There definitely several ways to achieve Pandas+docx combinations, before going for an NLP solution I suggest googling harder :) – Duygu Oct 28 '21 at 21:13
  • Thanks for ideas! – hidden layer Oct 28 '21 at 21:44
-1

Have you tried using REGEX? it solves most things like date and phone numbers.

here a small example so you can understand

Example

import re
import datetime
from datetime import date

register = "The last payment was 2021-09-21"
match = re.search(r'\d{4}-\d{2}-\d{2}', register)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print date

Output

2021-09-21
  • Yes, of course I tried it. You have demonstrated the simplest of all possible options in your answer. The fact is that even if I have a #### / ## / ## format in data, then later a different format may appear in other files, which my program will have to detect. – hidden layer Oct 28 '21 at 18:27