Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

Question

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

So for this sentence: 'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.'

Now, the tokens after incorporating the custom Spacy Tokenizer are:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“medicine', '”', 'has', ';', 'become', 'a', 'profession', ',', 'and', 'more', 'importantly', ',', "it's", 'a', 'male-dominated', 'profession', '.'

Earlier, the tokens before this change were:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male', '-', 'dominated', 'profession', '.'

And, the expected tokens should be:

'Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.'

Summary: As one can see...

the hyphen word is included and so are the other punctuation marks except for the double quotes and apostrophe...
...but now, the apostrophe and double quotes don't have the earlier or expected behaviour.
I have tried different permutations and combinations for the regex compile for the Infix but no progress to fix this issue.

To be clear, *“medicine”* was always tokenizing (wrongly, both before-and-after) with the trailing double-quote separate: *'“medicine', '”'*. And you also want to fix that. — smci, Jun 16 '20 at 20:53

score 24 · Accepted Answer · edited Jul 11 '23 at 14:33

24

Using the default prefix_re and suffix_re gives me the expected output:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

Prefixes and suffixes defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

With reference to characters (e.g, quotes, hyphens, etc.) defined here:

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

And the functions used to compile them (e.g., compile_prefix_regex):

https://github.com/explosion/spaCy/blob/master/spacy/util.py

edited Jul 11 '23 at 14:33

Ian Thompson

2,914
2
18
31

answered Jun 25 '18 at 11:19

Nicholas Morley

3,910
3
16
14

3

I can't thank you enough Nicholas! :) It works as expected now. The issue was with the default prefix_re and suffix_re as was rightly pointed out. Thanks also for sharing the links to the reference to punctuation & quotation characters (e.g, quotes, hyphens, etc.) as well as the link to compile them! They were really handy and would help to translate to cover all corner cases especially across other languages! – Vishal Jul 05 '18 at 07:28
1

Your recommended regex splits "This can't be it." as follows; ['This', 'can', "'", 't', 'be', 'it', '.'] which is not what one (or atleast I) would expect. – Zeeshan Ali Apr 29 '19 at 12:00
1

Your recommended regex solves all the provided issues, however it creates further issues just as the one mentioned by me above. – Zeeshan Ali Apr 29 '19 at 12:01
1

I, personally, have tried many ways to make sure that "intra-hyphen" words are not split apart, however I always end up creating other problems/issues regarding sentence or token splitting. – Zeeshan Ali Apr 29 '19 at 12:07
1

Eg; infixes = tuple([r"(n[o']t|'\w{1,2})\b", r"(?<!\d)\.(?!\d)"]) + nlp.Defaults.prefixes; infix_re = spacy.util.compile_infix_regex(infixes); nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer) – Zeeshan Ali Apr 29 '19 at 12:08
1

how do you make it work for tilda? `~2` is still not split even though `~` is in the list – Dima Lituiev Mar 10 '21 at 19:50

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

1 Answers1

Linked