1

I have a non-linguistic corpus of ~100 "documents", each comprising a sequence of ~10k "words" (i.e. I have a set of ~100 integer sequences). I can learn good doc2vec embeddings that respect known classes in the corpus. I'm now interested in summarizing these documents to help explain which motifs are not only representative of each document but also discriminative between classes.

I am primarily familiar with TextRank as an extractive summarization method, but this typically relies on sentences (i.e. subsequences that end with a period) as a sensible atom for the underlying node ranking algorithm. In my case, the sequence tokens are not known in advance as there are no sentences, per se.

Are there any summarization methods that take this into account? So far, I have tried using TextRank on all n-grams for a fixed n, but this precludes summaries involving tokens of different lengths, which happens to be crucial in my setting. Are there any multi-scale summarization methods, for instance?

MRicci
  • 143
  • 6
  • 1
    How large is the 'vocabulary' of unique terms? How significant do you expect ordering/subsets to be? For example, would the same 10k tokens of an existing doc, reshuffled, be almost certain to deserve the same categorization/summary as the original, or almost certain to mean something completely different (or be nonsense)? – gojomo Aug 05 '21 at 13:57
  • The vocabulary is very small (~100 terms) compared to the linguistic setting, but my goal was merely to show that contextual embeddings like doc2vec outperform sequence representations that are typically used in my subject area (which it handedly does). I have taken sensible steps to adapt doc2vec to the small-vocab domain. Order is crucial: bags of words are similar between classes and not very discriminative compared to representations respecting order (~30% drop in classification accuracy in my 5-class logistic regressor). – MRicci Aug 05 '21 at 14:12
  • I see. It strikes me that clustering on any sort of vectorized representation (be it doc2vec, bag-of-words, etc) is a sort of 'summarization', in that the clusters a item is 'in', or cluster regions/centroids that an item is 'near', are sort-of repeating 'motifs', which might align well with known-labels or cross-cut. – gojomo Aug 05 '21 at 19:09
  • With what you say about vocab-size, & relevance of order, I suspect more feature-engineering to detect signficant ordered-clumps in the 10k-token-runs might be warranted, as best hints to larger patterns – moreso than simple presence/absence, or even freq, of the 100 terms. You might try applying the Gensim `Phrases` class (or similar techniques), for statistically promoting *some* runs-of-tokens to be new compound tokens (either in lieu of, or alongside, the originals). Those added features might then be vividly discriminative in downstream classification/clustering/similarity-ranking evals. – gojomo Aug 05 '21 at 19:15
  • I had seen `Phrases` before but had not used it; I will give it a shot. Also, I am open to using any sort of clustering as long as the result is interpretable at the level of words/word sequences. This would seem not to be the case for some representations, like doc2vec, where the dimensions are unwordlike. – MRicci Aug 05 '21 at 21:07
  • While the individual axis-aligned dimensions of `Doc2Vec` may not be neatly interpretable/labelable, other *directions* (across many directions) or *regions* might be somewhat labelable, with some effort. Many `Doc2Vec` modes (`dm=1`, or `dm=0, dbow_words=1`) co-train individual-word vectors, and (at least when they're real words) the closeness of these word-vectors to the doc-vectors is often somewhat descriptive. – gojomo Aug 06 '21 at 17:03
  • 1
    Also, with either the singleton tokens, or any longer-runs of tokens you manage to promote to distinct features, you could (after initial training) create synthetic new examples, with certain features dropped or added, then provide those to things like `Doc2Vec` inference, or other classification/clustering steps, to see which are most-dispositive between either the labels you already know, or any new clusters you create. – gojomo Aug 06 '21 at 17:05

0 Answers0