Is it possible to use 'pipe' for batches of tokenized documents in Spacy?

Question

Based on this link: Is it possible to use spacy with already tokenized input?

I can get Spacy to take tokenized doc as input and process the doc further. And the code is below:

def nlp_process(self, token_tuple):
   # token_tuple = ("This is a test", ['This','is','a','test'])
    doc = Doc(self.nlp.vocab, words=token_tuple[1])
    for name, proc in self.nlp.pipeline:
      doc = proc(doc)

   return doc

This works well for single input. What about if I want to process docs in batch mode by using nlp.pipe() function? Something like:

   nlp_docs = self.nlp.pipe(texts)

The pipe takes a list of raw text. How to deal with this situation?

score 0 · Accepted Answer · answered Jun 29 '20 at 07:21

There's no way to set up a tokenizer-less pipeline in spacy. One option is to call each pipeline component individually after creating the docs:

for pipe_name in nlp.pipe_names:
    docs = nlp.get_pipe(pipe_name).pipe(docs)

Or another equivalent:

for pipe_name, pipe in nlp.pipeline:
    docs = pipe.pipe(docs)

If you're not enabling multiprocessing, this will be as efficient as using nlp.pipe() because this is also more or less what nlp.pipe() does underneath.

Another alternative is to create your own replacement tokenizer that accepts either List[str] or Doc inputs and replace nlp.tokenizer with this custom tokenizer. Then you can call nlp.pipe() as usual. The simplest version of this with List[str] input would look like this:

from spacy.tokens import Doc

class CustomTokenizer(DummyTokenizer):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, words):
        return Doc(self.vocab, words=words)

nlp.tokenizer = CustomTokenizer(nlp.vocab)
doc = nlp(["This", "is", "a", "sentence", "."])

This example and some related examples and discussion are here: https://github.com/explosion/spaCy/issues/5399

The first option also ran into an issue. The neural network's predict reports a 'no doc attribute' error in ''doc.doc'. — marlon, Jun 29 '20 at 22:13
Regarding 'Then you can call nlp.pipe() as usual': if I use the 'CustomTokenizer' option, when I use nlp.pipe(), would the input of pipe is a list of a list of tokens? inputs = [['This', 'is', 'test'], ['This', 'was', 'test']]? This won't work, since the CustomToeknizer's input is a list of words, not list of list of words. — marlon, Jun 29 '20 at 22:17
The `docs` argument for `pipe.pipe()` is `List[Doc]`. If you use `nlp.pipe()` with the custom tokenizer then you provide `List[List[str]]`, if you use `nlp()`, then `List[str]` as in the example above. — aab, Jun 30 '20 at 09:18

Is it possible to use 'pipe' for batches of tokenized documents in Spacy?

1 Answers1