4

I would expect summarization tasks to generally assume long documents. However, following documentation here, any of the simple summarization invocations I make say my documents are too long:

>>> summarizer = pipeline("summarization")
>>> summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5620 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (8084 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5971 > 512). Running this sequence through the model will result in indexing errors

What model or configuration choice makes this most automatic? I've read other questions suggesting manually chunking the data or truncation, but the choice of boundaries and chunk length seem like they would make a difference in summaries. What's the best practice for an arbitrary long document? (Unbounded would be great, but let's say 50,000 tokens at a minimum.)

Mittenchops
  • 18,633
  • 33
  • 128
  • 246
  • You could try Longformer Encoder-Decoder model https://huggingface.co/docs/transformers/master/model_doc/led which handles up to 16k tokens, Reformer, BigBird and so on – kkgarg Dec 11 '21 at 19:50

1 Answers1

10

I am assuming a minimum token length of 50k means that you are trying to summarize something as big as a novel. Unfortunately, we are yet to have a model that can process that much of data at once. This is mostly because the memory footprint of such models will be so high to use in production. But pegasus(google), Longformer, Reformer are all viable options for summarizing long documents. Research is still going on for creating models that can process larger sequences without consuming a lot of resources. For example reformer itself is highly optimized to handle a large number of tokens https://huggingface.co/blog/reformer. By far the best practice is "Divide and Conquer" approach. ie, to chunk your data keeping the maximum length as a reference. You may even do it in iteration until you reach the specified summary length. You may also explore different methods of summarization such as extractive and abstractive summarization, and use your creativity in combining those techniques such as extractive summarization followed by abstractive.

codeslord
  • 2,172
  • 14
  • 20