Using regex in spaCy: matching various (different cased) words

Question

Edit due to off-topic

I want to use regex in SpaCy to find any combination of (Accrued or accrued or Annual or annual) leave by this code:

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('LEAVE', None, 
            [{'TEXT': {"REGEX": "(Accrued|accrued|Annual|annual)"}}, 
             {'LOWER': 'leave'}])

# Call the matcher on the doc
doc= nlp('Annual leave shall be paid at the time . An employee is  to receive their annual leave payment in the normal pay cycle. Where an employee has accrued annual leave in')

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print('- ', matched_span.sent.text)

# returned:
- Annual leave shall be paid at the time .
- An employee is  to receive their annual leave payment in the normal pay cycle.
- Where an employee has accrued annual leave in

However, I think my regex was not abstract/generalized enough to be applied to other situations, I would be very much appreciated for your advice on how to improve my regex expression with spaCy.

Why to add a Regex matcher? When you tokenize you get lowercased form, then you can make a list lookup function. — Tiago Duque, Aug 20 '19 at 12:34
Thanks for your advice, @TiagoDuque. The reason I used the regex was I wanted to be more succinct (instead of creating patterns: `[{'LOWER': 'annual'}, {'LOWER': 'leave'}]` and `[{'LOWER': 'accrued'}, {'LOWER': 'leave'}]`) , Could you please elaborate on what you meant by creating a list lookup function? Would you mind showing me how to do it so that I can retrieve all three sentences? — Nemo, Aug 20 '19 at 12:49
I've been checking on your idea and I've found an even better solution. I'll post it below. — Tiago Duque, Aug 20 '19 at 13:05
I think it works as expected, you just have a typo in `ananual`. `"(Accrued|accrued|Annual|ananual)"` -> `"(Accrued|accrued|Annual|annual)"`. Your code yields all 3 sentences then. — Wiktor Stribiżew, Aug 20 '19 at 13:53
But you really do not have to repeat the differently cased words, with regex, it is just ``"(?i)accrued|annual"``. To match whole words, add word boundaries, `r"(?i)\b(?:accrued|annual)\b"` — Wiktor Stribiżew, Aug 20 '19 at 14:00
@WiktorStribiżew, thank you for your sharp eyes (pointing out my typo) and great regex expression. Would you please post your reply as answer so that I could accept it? — Nemo, Aug 21 '19 at 01:35
@WiktorStribiżew, would you mind also please explaining why you used `(?:accrued|annual)` instead of `(?Paccrued|annual)` because I read that `(?:A)` matches the expression as represented by A, but unlike `(?PAB)` - which matches the expression AB - it cannot be retrieved afterwards. — Nemo, Aug 21 '19 at 03:47

score 2 · Accepted Answer · answered Aug 21 '19 at 08:53

2

Your code is fine, you just have a typo in ananual and your code will yield all 3 sentences then.

However, you do not need to repeat the differently cased words. With Python re regex, you may pass the (?i) inline modifier to the pattern start and it will all be case insensitive.

You may use

"(?i)accrued|annual"

Or, to match whole words, add word boundaries \b:

r"(?i)\b(?:accrued|annual)\b"

Note the r prefix before the opening " making the string literal raw, and you do not have to escape \ in it. r"\b" = "\\b".

The (?:...) non-capturing group is there to make sure \b word boundaries get applied to all the alternatives inside the group. \baccrued|annual\b will match accruednesssss or biannual, for example (it will match words that start with accrued or those ending with annual).

answered Aug 21 '19 at 08:53

Wiktor Stribiżew

607,720
39
448
563

Thanks for your explanation of the prefix `r`. However, an example in spaCy documentation (https://spacy.io/usage/rule-based-matching) didn't use it: `pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}}, {"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}, {"LOWER": "president"}]` Was that a typo (of missing the `r`)? – Nemo Aug 21 '19 at 11:34
1

@Nemo This is not about Spacy, this is a pure Python thing. Please study [string literals](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals). Also, regarding regex, check [Python regex - r prefix](https://stackoverflow.com/questions/2241600/python-regex-r-prefix). And there are [more](https://stackoverflow.com/questions/26318287/what-does-r-mean-before-a-regex-pattern) than [that](https://stackoverflow.com/questions/21104476/what-does-the-r-in-pythons-re-compiler-pattern-flags-mean), [sure](https://stackoverflow.com/questions/2081640). – Wiktor Stribiżew Aug 21 '19 at 11:37

score 0 · Answer 2 · answered Aug 20 '19 at 13:34

In many NLP libraries, the tokenizing activity lowercases all tokens, making it unecessary to create a regex for each word. That is the case for Spacy.

However, Spacy matcher works better if you make use of the linguistic features that it is packaged with.

Let's start by creating a matcher based on linguistic features: you want to detect any type of leave (annual and I guess in the future you might consider monthly, weekly, etc) - these are all adjectives. So you could define a pattern that includes the "leave" word preceded by an adjective, like so:

pattern = [{'POS': 'ADJ'},
           {'LEMMA': 'leave'}]

In the above snippet, POS stands for Part of Speech and recieives the value of ADJ (for adjective). LEMMA stands for the word 'root'. You can check this online example. Notice, however, that "accrued" is being recognized as a verb, and not adjective (in fact, this polysemy problem is there for any NLP library). You could also another pattern just for "accrued leave", using two "lemma" values.

Just add the matcher and you're good to go:

matcher = Matcher(nlp.vocab)
matcher.add(pattern)
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print('- ', matched_span.sent.text)

Any adjective *immediately* before leave. This is more general. — Tiago Duque, Aug 20 '19 at 13:46
Yeah, but not just `accrued` or `annual`, which is the question about. — Wiktor Stribiżew, Aug 20 '19 at 13:50
Sure, it includes "monthly", "weekly" or anything as commented. If he just wants accrued or annual, just create two patterns and change the first matcher attribute to 'LEMMA': 'accrued'; 'LEMMA': 'annual' — Tiago Duque, Aug 20 '19 at 13:53

Using regex in spaCy: matching various (different cased) words

2 Answers2