1

Hi I tried to match the words using spacy for the texts like

1 cups 1 1/2 cups 1 1/2-inch

To achieve this, I created matcher pattern as below.

pattern1 = [{'POS':'NUM'},
           {'POS':'NUM','OP':'?'},{'POS':'NOUN'},];
# number number noun pattern

pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},
           {'POS': 'NOUN'}];
# number after number but optional to cover both '2 inch' and '2 1/2 inch' 
# it should also cover '2 1/2-inch' so put 'ORTH':'-' but optional

However, when I run the matcher, it only returns one pattern which is number followed by noun like below.

matcher.add('Measurepattern',None,pattern1)
matcher.add('Measurepattern',None,pattern2)

matches=matcher(test_token)

matches

for token,start,end in matches:
    print(test_token[start:end])

//2 teaspoons
//1 teaspoon
//1 cup

Why is that and how do I fix this?

Thank you

BS100
  • 823
  • 6
  • 22
  • 1
    This actually seems to work. I get expected matches. – Wiktor Stribiżew Mar 03 '21 at 20:31
  • @WiktorStribiżew I didn't get the matches like 1 1/2 pounds – BS100 Mar 03 '21 at 20:45
  • 1
    What is your Spacy version? – Wiktor Stribiżew Mar 03 '21 at 20:53
  • @WiktorStribiżew it says 2.3.2 – BS100 Mar 03 '21 at 20:55
  • Ok, in that version, `1 1/2-inch` is tokenized as `('1', 'NUM'), ('1/2-inch', 'NUM')`, there will be no match if you do not add a specific pattern. I still get `[1 cups, 1 1/2 cups]` as output with the `nlp = spacy.load("en_core_web_sm")`. – Wiktor Stribiżew Mar 03 '21 at 21:04
  • What if you add `pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];`? – Wiktor Stribiżew Mar 03 '21 at 21:07
  • @WiktorStribiżew you meant you are using the same version as mine and going with same pattern as mine and still get the right output? or you meant I have to add a pattern you wrote above pattern3=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}] ? what version are you using??? are you using v.3? I installed spacy following their instruction but still getting 2.3.2. not v.3 can you let me know how I could force my conda to update mine to 3?? Thank you a lot – BS100 Mar 03 '21 at 21:09
  • 1
    I installed 2.3.2 version in a virtual environment, and [here is the code](https://ideone.com/r8lsvG) that yields `[1 cups, 1 1/2 cups, 1 1/2-inch]` for `doc = nlp("1 cups, 1 1/2 cups, 1 1/2-inch")` given `nlp = spacy.load("en_core_web_sm")`. I do not use conda. – Wiktor Stribiżew Mar 03 '21 at 21:11
  • @WiktorStribiżew That works!! Thank you soooooo much. If you don't mind can you let me know how to remove the result from the original doc? Sorry for asking too many and really appreciate already. my original is test_token=nlp('2 teaspoons salt 1 teaspoon vanilla extract 1 cup mixed candied fruit (such as glacéed cherries and citron, orange, or lemon peel), diced 1 cup golden raisins Special equipment: 2 (8-inch) or 8 (3 1/2-inch) paper panettone molds* *Available at baking supply stores and www.sugarcraft.com.') – BS100 Mar 03 '21 at 21:18
  • Sorry, remove the spans from the `test_token` doc? Or the found texts from the input string? – Wiktor Stribiżew Mar 03 '21 at 21:22
  • @WiktorStribiżew Since I want to get a list having words after filtering the matched words like 1 cups, 1 1/2 cups etc. The final out put must be like ['salt' ,'vanilla', 'extract', 'mixed', 'candied', fruit...'] not having any of the matches – BS100 Mar 03 '21 at 21:25
  • I am a bit unclear about the task, but you may use `re` to get you the strings. `import re`, then `spans = map(str, spacy.util.filter_spans([test_token[start:end] for _, start, end in matches]))`, then `print(re.sub(r"(?<!\S)(?:{})(?!\S)".format('|'.join(map(re.escape, sorted(spans, key=len, reverse=True)))), "", test_token.text).split())` – Wiktor Stribiżew Mar 03 '21 at 21:34
  • @WiktorStribiżew OMG!!!! Thank you sooooo much!!! – BS100 Mar 03 '21 at 22:05
  • 1
    Just occurred to me that word boundaries (rather than whitespace ones) should work better here, use `r"\b(?:{})\b"` instead of `r"(?<!\S)(?:{})(?!\S)"` – Wiktor Stribiżew Mar 03 '21 at 22:19
  • @WiktorStribiżew Thank you! If you don't mind, can I ask you how to select all the words between parentheses? ( anyword1 word2 word3 word4 word5 word6) <-select everything cause I'm going to delete this part using the method you suggested. I believe it's regex but my regex doesn't work for this case... really appreciate it.. cause I added regex like this pattern5=[{"TEXT":{"REGEX":"r'\((.*)\)'"}}] – BS100 Mar 03 '21 at 22:24
  • 1
    I think you should remove those parts before creating the document, see [this answer of mine](https://stackoverflow.com/a/40621332/3832970), use `text = re.sub(r'\s*\([^()]*\)', '', text)`. – Wiktor Stribiżew Mar 03 '21 at 22:41
  • @WiktorStribiżew Totally Agree with your suggestion. Thank you so much! – BS100 Mar 03 '21 at 23:20

1 Answers1

1

In Spacy 2.3.2, 1 1/2-inch is tokenized as ('1', 'NUM'), ('1/2-inch', 'NUM'), so there will be no match with your current patterns if you do not introduce a new, specific pattern.

Here is an example one: pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];. The regex matches a token whose text starts with one or more digits, then has an optional sequence of / and one or more digits and then has a - and then any one or more word chars (letters, digits or _). You may replace \w with [^\W\d_] to match only letters.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern1 = [{'POS':'NUM'}, {'POS':'NUM', 'OP':'?'}, {'POS':'NOUN'}];
pattern2=[{'POS':'NUM'},{'POS':'NUM','OP':'?'},{"ORTH": "-",'OP':'?'},{'POS': 'NOUN'}];
pattern3=[{'POS':'NUM'},{"TEXT": {"REGEX":"^\d+(?:/\d+)?-\w+$"}}];

matcher.add("HelloWorld", [pattern1, pattern2, pattern3])

doc = nlp("1 cups, 1 1/2 cups, 1 1/2-inch")
print([(t.text, t.pos_) for t in doc])
#[('1', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2', 'NUM'), ('cups', 'NOUN'), (',', 'PUNCT'), ('1', 'NUM'), ('1/2-inch', 'NUM')]

matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))
## => [1 cups, 1 1/2 cups, 1 1/2-inch]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563