1

I'm working on a text corpus in which many individual tokens contain punctuations like : - ) ( @. For example, TMI-Cu(OH). Therefore, I want to customize the tokenizer to avoid splitting on : - ) ( @, if they are tightly enclosed (no whitespace) by digits/letters.

From this post, I learned that I can modify the infix_finditer to achieve this. However, the solution still split on ), if a ) is not followed by a word/digit, as demonstrated in the example:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)

test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
[token.text for token in doc0]

The output is ['This', 'is', 'TMI-Cu(OH', ')', ',', 'and', 'somethig', 'else'], where the individual token TMI-Cu(OH) is split into two tokens ['TMI-Cu(OH', ')'].

Is it possible to implement a 'lookbehind' behavior in the tokenizer? Therefore, for a tuple ')' that is followed by a non-word/non-digit character, before splitting on it to generate a new token, look behind in the first place to see if there is a whitespace between the ')' and the paired '('. If there is no whitespace, then don't split.

meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34
  • Do you want to avoid splitting on `.`, `,`, `?`, `:`, `;`, `…`, `‘`, `’`, backtick, `“`, `”`, `"`, `'` and `~`? Anywhere? Or between letters/digits only? Does it mean `TMI-Cu(OH)2,` should be the single token in the output, and not `['TMI-Cu(OH)2', ',']`? – Wiktor Stribiżew Apr 29 '22 at 07:53
  • The tokenizer works with each whitespace-separated string separately, so you can only look behind within that context. – aab Apr 29 '22 at 07:55
  • @WiktorStribiżew Hi! I want to avoid splitting when punctuations comes between letters/digits, so `TMI-Cu(OH)2,` should be `['TMI-Cu(OH)2', ',']`. The problem is if there is no letters/digits right behind `)` (`TMI-Cu(OH),`), the result becomes `'[TMI-Cu(OH', ')', ',']`, while the expected one is `'[TMI-Cu(OH)', ',']`. – meTchaikovsky Apr 29 '22 at 08:13

1 Answers1

2

You need to remove the ) from the suffixes:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''(?:[^\w\s]|_)(?<![-:@()])''') # Matching all special chars with your exceptions
    suffixes = nlp.Defaults.suffixes
    suffixes.remove(r'\)')   # Removing the `\)` pattern from suffixes
    suffix_re = compile_suffix_regex(suffixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)

test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
print([token.text for token in doc0])

Output:

['This', 'is', 'TMI-Cu(OH)', ',', 'and', 'somethig', 'else']

Note the (?:[^\w\s]|_)(?<![-:@()]) regex I used for infix matching matches any special character other than whitespace and -, :, @, ( and ) chars.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you! :) This solution works fine except for one problem: if the sentence is like `This is AAA (it is TMI-CuOH), 50 mn, and somethig else`, when the `)` is not an enclosing tuple for `OH`. Is it possible to make further adjustment to detect such tuples? – meTchaikovsky Apr 30 '22 at 02:52
  • @meTchaikovsky I suspect it is possible if retokenization function is implemented, you need to perform a retrospective check across tokens. I will check once I am back home. – Wiktor Stribiżew Apr 30 '22 at 11:25