With spacy, how to make sure a sequence of letters is never split into tokens

Question

I'm searching for a way to make sure any time the sequence "#*" appears in the text, spacy gives me the token "#*". I tried every possible way of adding special cases with add_special_case, building a custom Tokenizer using prefix_search, suffix_search, infix_finditer and token_match, but there are still cases when if a "#*" appears in a sentence, even when it's surrounded by tokens that are not weird (tokens that should be recognized without a problem), the "#*" is splitted into [#, *]. What can I do?

Thanks.

Hope the second approach in this answer of mine helps : https://stackoverflow.com/a/43390171/533399 — DhruvPathak, Nov 25 '19 at 06:34

score 4 · Accepted Answer · answered Nov 25 '19 at 08:09

Spacy's current handling of special cases that contain characters that are otherwise prefixes or suffixes isn't ideal and isn't quite what you'd expect in all cases.

This would be a bit easier to answer with examples of what the text looks like and where the tokenization isn't working, but:

If #* is always surrounded by whitespace, a special case should work:

nlp.tokenizer.add_special_case("#*", [{"ORTH": "#*"}])
print([t.text for t in nlp("a #* a")]) # ['a', '#*', 'a']

If #* should be tokenized as if it is a word like to, one option is remove # and * from the prefixes and suffixes and then those characters aren't treated any differently from t or o. Adjacent punctuation would be split off as affixes, adjacent letters/numbers wouldn't be.

If #* is potentially adjacent to any other characters like #*a or a#*a or "#*", it's probably easiest to add it as a prefix, suffix, and infix, adding it before the default patterns so that the default patterns like # aren't matched first:

prefixes = ("#\*",) + nlp.Defaults.prefixes
nlp.tokenizer.prefix_search = spacy.util.compile_prefix_regex(prefixes).search
suffixes = ("#\*",) + nlp.Defaults.suffixes
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search
infixes = ("#\*",) + nlp.Defaults.infixes + ("#\*",)
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

print([t.text for t in nlp("a#* a#*a #*a '#*'")])
# ['a', '#*', 'a', '#*', 'a', '#*', 'a', "'", '#*', "'"]

This is a good case for using the new debugging function that was just added the tokenizer (disclaimer: I am the author). With spacy v2.2.3 try:

nlp.tokenizer.explain('#*')

The output [('PREFIX', '#'), ('SUFFIX', '*')] tells you which patterns are responsible for the resulting tokenization. As you modify the patterns, this function should let you see more easily whether your modifications are working as intended.

After the modifications in the final example above, the output is:

nlp.tokenizer.explain("a#* a#*a #*a '#*'")
# [('TOKEN', 'a'), ('SUFFIX', '#*'), ('TOKEN', 'a'), ('INFIX', '#*'), ('TOKEN', 'a'), ('PREFIX', '#*'), ('TOKEN', 'a'), ('PREFIX', "'"), ('PREFIX', '#*'), ('SUFFIX', "'")]

Thanks for the detailed reply. I'll look into it all of it later. Since I was in a hurry, I just transformed the original text in order to systematically add space before and after these sequences so that add_special_case would finally work. By the way, in the documentation of functions such as `add_special_case`, or `compile_suffix_regex`, I found it impossible to determine what parameters to these functions should be regexes, or regex fragments and what parameters should be thought as simple strings. Maybe the doc could be improved about that. Thanks again for the reply. — John Smith Optional, Nov 25 '19 at 11:57

With spacy, how to make sure a sequence of letters is never split into tokens

1 Answers1