3

I am wondering if there are some automatic summarization algorithms who handle extraction based on a custom dictionary. I’ve been using textrank based algorithms for a while now, but I want to have an impact on the ranking of phrases that the algorithm calculates.

Example

"Thomas A. Anderson is a man living two lives. By day he is an average computer programmer and by night a hacker known as Neo. Neo has always questioned his reality, but the truth is far beyond his imagination. Neo finds himself targeted by the police when he is contacted by Morpheus, a legendary computer hacker branded a terrorist by the government. Morpheus awakens Neo to the real world, a ravaged wasteland where most of humanity have been captured by a race of machines that live off of the humans' body heat and electrochemical energy and who imprison their minds within an artificial reality known as the Matrix. As a rebel against the machines, Neo must return to the Matrix. He must confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion."

My custom dictionary would look like something like this:

super-powerful: [important]
Thomas A. Anderson: [important]

My summary should contain the following sentences, even if their ranking is lower than some other sentences in the paragraph:

  1. "Thomas A. Anderson is a man living two lives"
  2. "He must confront the agents: super-powerful computer programs devoted to snuffing out Neo and the entire human rebellion."

I've tried to reach this by adding extra tags to my POS-tagged sentences, it looks like this:

[[('Thomas A. Anderson', 'Thomas A. Anderson', ['important']), ('is', 'is', ['VBZ']), ('a', 'a', ['DT']), ('man', 'man', ['NN']), ('living', 'living', ['VBG']), ('two', 'two', ['CD']), ('lives', 'lives', ['NNS'])]]

[[('He', 'He', ['PRP']), ('must', 'must', ['MD']), ('confront', 'confront', ['VB']), ('the', 'the', ['DT']), ('agents', 'agents', ['NNS']), (':', ':', [':']), ('super-powerful', 'super-powerful', ['important', 'JJ']), ('computer', 'computer', ['NN']), ('programs', 'programs', ['NNS']), ('devoted', 'devoted', ['VBD']), ('to', 'to', ['TO']), ('snuffing', 'snuffing', ['VBG']), ('out', 'out', ['RP']), ('Neo', 'Neo', ['NNP']), ('and', 'and', ['CC']), ('the', 'the', ['DT']), ('entire', 'entire', ['JJ']), ('human', 'human', ['JJ']), ('rebellion', 'rebellion', ['NN']), ('.', '.', ['.'])]]

But I don't really know how I can tell the textrank algorithm to give priority at sentences with those tags. I've used Python with nltk and yaml to reach this output

Help would be greatly appreciated!

  • Read this SO Q&A, maybe you can use ths: http://stackoverflow.com/questions/42269313/interpreting-the-sum-of-tf-idf-scores-of-words-across-documents – stovfl Apr 13 '17 at 13:06
  • What TextRank algorithm do you use? is it publicly available? – amirouche Jul 19 '17 at 07:59

0 Answers0