I would expect summarization tasks to generally assume long documents. However, following documentation here, any of the simple summarization invocations I make say my documents are too long:
>>> summarizer = pipeline("summarization")
>>> summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5620 > 1024). Running this sequence through the model will result in indexing errors
>>> summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (8084 > 1024). Running this sequence through the model will result in indexing errors
>>> summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5971 > 512). Running this sequence through the model will result in indexing errors
What model or configuration choice makes this most automatic? I've read other questions suggesting manually chunking the data or truncation, but the choice of boundaries and chunk length seem like they would make a difference in summaries. What's the best practice for an arbitrary long document? (Unbounded would be great, but let's say 50,000 tokens at a minimum.)