0

How could I use spacy to detect a pattern like:

mygov-24.mygov.ca - last updated: 06/05/21

I want to detect the following patterns:

  • mygov-24.mygov.ca - last updated: 06/05/21
  • mygov-24.mygov.ca - last updated: 02/04/21
  • mygov-24.mygov.ca - last updated: 01/02/21
  • ....

As you could see the date changes but everything remains the same. How can I use spacy to create a pattern matcher, that tells if the input string has the same pattern? Also, If the pattern is detected, I want to extract the date. Is that possible with Spacy?

I went through, https://spacy.io/usage/rule-based-matching but not sure where to start.

EDIT Given a group of dynamic phrases as above, is there a way to identify the variables within the phrases?

Amanda
  • 2,013
  • 3
  • 24
  • 57

1 Answers1

0

You can detect them with the matcher, using the code like

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [{'ORTH': 'mygov-24.mygov.ca'}, {'ORTH':'-'}, {'ORTH':'last'}, {'ORTH':'updated'}, {'ORTH':':'}, 
           {'ORTH': {'REGEX':r'^\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?$'}}];
matcher.add("last_updated", [pattern])

text = r'It was here, mygov-24.mygov.ca - last updated: 06/05/21. Next: mygov-24.mygov.ca - last updated: 02/04/21. And one more: mygov-24.mygov.ca - last updated: 01/02/21'
doc = nlp(text)

matches = matcher(doc)

matches = matcher(doc, as_spans=True)
for span in matches:
    print(span.text)

Output:

mygov-24.mygov.ca - last updated: 06/05/21
mygov-24.mygov.ca - last updated: 02/04/21
mygov-24.mygov.ca - last updated: 01/02/21

The ^\d{1,2}/\d{1,2}/\d{2}(?:\d{2})?$ regex matches a token that

  • ^ - starts with
  • \d{1,2} - one or two digits
  • / - then has a /
  • \d{1,2}/ - then has one or two digits and /
  • \d{2} - two digits
  • (?:\d{2})? - additionally optional two digits
  • $ - end of token.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Is it possible to detect dynamic variables given X number of phrases? I need to detect variables in the phrases. Manually defining a regex will not be optimal – Amanda May 07 '21 at 17:46
  • @Amanda You need to explain that in the question, I have no idea what you mean. – Wiktor Stribiżew May 07 '21 at 17:52
  • Updated. Given a group of dynamic phrases as above, is there a way to identify the variables within the phrases? Regex based solution would be avaialble without using Spacy – Amanda May 08 '21 at 04:24
  • @Amanda Do you think Spacy will guess any patterns for you? This is not possible. You still need to specify them yourself. – Wiktor Stribiżew May 08 '21 at 10:19
  • Given n number of sentences, is there a way to identify the pattern and detect the variable? I am looking for something similar to the flow here https://youtu.be/B34gHahWX_s – Amanda May 08 '21 at 11:23
  • @Amanda There is no way to do it "easily". There are special [tools](https://www.regexmagic.com/) for that. Spacy is not meant to be used like that anyway. Also, see [How to auto generate regex from given list of strings?](https://stackoverflow.com/questions/4880402/how-to-auto-generate-regex-from-given-list-of-strings). – Wiktor Stribiżew May 08 '21 at 11:33
  • @Amanda [This Github repo](https://github.com/iuliux/RegExTractor) contains code that can convert a list of strings into a regex that can work the way you need. For Python3, you will need to fix `print` statements. – Wiktor Stribiżew May 08 '21 at 12:49
  • Did you see the video I shared? – Amanda May 08 '21 at 14:26
  • The repo that you shared fails to create a generic regex for date / numbers. – Amanda May 08 '21 at 14:40