I am using the Xsum dataset for abstractive summarization. There are summaries that contain common ngrams. I need to get all the articles whose summaries contain these common ngrams.
For example, if I have the following articles and their corresponding summaries:
Article Summary
article1. x a a b d m
article2. x a b d c e m
article3. y z c f a b d c e q u
article4. m g a a b d v r a
article5. r a e q u d x
And I want all documents having n-grams greater than or equal to 4, then the output should be:
Articles. Common n-gram
article1, article4 : a a b d
article2, article3 : a b d c e
I have a dataset containing 200k articles and corresponding summaries.
What I have tried:
I tried using lucene to
- Index the documents
- For the ngrams of the summaries
But I don't know java and it's difficult to figure out how to get the documents with the common ngrams.
Help
Can someone please guide me as to how it can be done in python? Or if lucene, then if someone could please point me in the right direction? I have gone through the lucene tutorials but I didn't find anything to help with my specific need and I was only left more confused.
I got this from a youtube video. My idea is that instead of the analyzer breaking the text into individual tokens, what if it breaks into ngrams. Then in my inverted index, I will have ngrams, their frequency and the documents they show up in.
Thank you.