How to get the documents with n-gram repetitions in their summaries

Question

I am using the Xsum dataset for abstractive summarization. There are summaries that contain common ngrams. I need to get all the articles whose summaries contain these common ngrams.

For example, if I have the following articles and their corresponding summaries:

 Article       Summary

article1.    x a a b d m
article2.    x a b d c e m
article3.    y z c f a b d c e q u
article4.    m g a a b d v r a
article5.    r a e q u d x

And I want all documents having n-grams greater than or equal to 4, then the output should be:

Articles.            Common n-gram
article1, article4 :  a a b d
article2, article3 :  a b d c e

I have a dataset containing 200k articles and corresponding summaries.

What I have tried:

I tried using lucene to

Index the documents
For the ngrams of the summaries

But I don't know java and it's difficult to figure out how to get the documents with the common ngrams.

Help

Can someone please guide me as to how it can be done in python? Or if lucene, then if someone could please point me in the right direction? I have gone through the lucene tutorials but I didn't find anything to help with my specific need and I was only left more confused.

EDIT

I got this from a youtube video. My idea is that instead of the analyzer breaking the text into individual tokens, what if it breaks into ngrams. Then in my inverted index, I will have ngrams, their frequency and the documents they show up in.

Thank you.

This does not appear to be a Lucene-related question, but instead a question about finding common subsets across a collection of sets (with the added constraint that the subsets must contain at least 4 items). For example, see [The intersection of all combinations of n sets](https://stackoverflow.com/q/26028124/12567365). — andrewJames, Oct 31 '21 at 14:46
Lucene probably can't help with this because even if you indexed your ngram data, Lucene would only help you find one specific set of results at a time ("_show me all the docs containing 'a a b d'... now show me all the docs containing 'a a b e'..._"). You would have to feed many such queries into Lucene to find all the required matches. I notice this new question is very similar to [your earlier one](https://stackoverflow.com/q/69701972/12567365). It is generally better to edit the original question, instead of creating a duplicate. If I have misunderstood, you can clarify, of course. — andrewJames, Oct 31 '21 at 14:48
@andrewJames Thank you for your comment! Well in the earlier post, I was struggling with the ngram tokenization and indexing itself. Figured that out, but then got stuck just one step after, which was this post. My ngram data will be the queries that I will want to check if they exist in other summaries. So can I iterate over the queries and get the matches?..I mean I guess I can, but would there be an efficient alternative in your opinion? Also, thank you for the alternative solution above. It does not take in ngrams, rather just computes the common sets so I guess I can't use that. — Kiera.K, Oct 31 '21 at 20:26
_"It does not take in ngrams"_ - Right - [it takes in integers](https://stackoverflow.com/a/66068023/12567365), not ngrams - but the _algorithm_ is what you want. Integers vs. ngrams is incidental. The approach solves the problem described in your question. — andrewJames, Oct 31 '21 at 20:49
@andrewJames No no..by ngrams I meant I need a set of words that are contiguous. In the example that the person has shown, he needs _" Set 1 & 2: 3, 11"_ but in the input 3 and 11 are not consecutive...So I thought that won't work for me. But yes, I am giving it a close read, thank you — Kiera.K, Oct 31 '21 at 20:55
@andrewJames Hello, could you please guide as to how I could make an analyzer break up the text into indexed tokens? I believe that's what your 2nd comment meant...how can I do that? Will I have to write a custom analyzer for it or is there an easier way for a beginner? I have created the word based ngrams of the document using ShingleFilter. But can you please help with how I get the Analyzer to break text into word based ngrams? Thank you very much. — Kiera.K, Nov 01 '21 at 06:34
I am sorry but I don't understand what you are asking. According to the question you already have your ngrams. You don't need Lucene to create them. Maybe if you show some realistic (minimal) input data, that could help. But regardless, I still do not see why you want to use Lucene. It does not allow you to query the data in the way your question requires (to find that list of document sets, as shown in the question). — andrewJames, Nov 01 '21 at 10:51
@andrewJames Hello, I have just added an edit for what I think I could do. Could you please tell me if that is possible in Lucene (and if so, could you please give me some direction)or if I should give up on Lucene? — Kiera.K, Nov 01 '21 at 14:13
I think we seem to be talking at cross-purposes. I am very sorry I have not been able to help you. — andrewJames, Nov 01 '21 at 14:37

How to get the documents with n-gram repetitions in their summaries

0 Answers0