I'm working on a text corpus in which many individual tokens contain punctuations like : - ) ( @
. For example, TMI-Cu(OH)
. Therefore, I want to customize the tokenizer to avoid splitting on : - ) ( @
, if they are tightly enclosed (no whitespace) by digits/letters.
From this post, I learned that I can modify the infix_finditer
to achieve this. However, the solution still split on )
, if a )
is not followed by a word/digit, as demonstrated in the example:
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
[token.text for token in doc0]
The output is ['This', 'is', 'TMI-Cu(OH', ')', ',', 'and', 'somethig', 'else']
, where the individual token TMI-Cu(OH)
is split into two tokens ['TMI-Cu(OH', ')']
.
Is it possible to implement a 'lookbehind' behavior in the tokenizer? Therefore, for a tuple ')' that is followed by a non-word/non-digit character, before splitting on it to generate a new token, look behind in the first place to see if there is a whitespace between the ')' and the paired '('. If there is no whitespace, then don't split.