4

I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.

What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?

(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)

theycallmemorty
  • 12,515
  • 14
  • 51
  • 71
Jacco
  • 23,534
  • 17
  • 88
  • 105

10 Answers10

5

Think of the task of summarization as a challenge to 'select the most important sentences' from the document.

The method described in The Automatic Creation of Literature Abstracts by H.P. Luhn (1958) describes a naive method that actually performs quite well. Try giving it a shot.

If your website is in Python coding this algorithm using the NLTK (Natural Language Toolkit) is a fun task.

theycallmemorty
  • 12,515
  • 14
  • 51
  • 71
4

Make it predictable.

From a users perspective simply using the first paragraph is not bad at all. Using any automation is bound to fall flat in some cases. So I suggest to display the first paragraph (maybe truncating at some point) as a summary and offer the ability to override that by an optional field.

phoku
  • 2,082
  • 1
  • 18
  • 16
1

I might try using mechanical Turk or any number of other crowdsourcing options.

Mark P Neyer
  • 1,009
  • 2
  • 8
  • 19
1

Another item to check out, a SourceForge project, AutoSummary Semantic Analysis Engine

David Boike
  • 18,545
  • 7
  • 59
  • 94
1

Not a trivial task... You should look for articles or books on "extractive summarization"

A few starters could be:

Books:

Articles:

cfischer
  • 24,452
  • 37
  • 131
  • 214
  • 1
    The "how to identify the gist of a text" paper also has software available: http://www.icmc.usp.br/~taspardo/GistSumm.htm – Nate Kohl Oct 06 '09 at 12:47
  • Also, the MEAD project (http://www.summarization.com/mead/) by some folks at the University of Michigan looks like it has software available, although the link is down right now. – Nate Kohl Oct 06 '09 at 12:59
  • Other links are dead, so the "how to identify the gist of a text" paper can now be found here: http://www.icmc.usp.br/~taspardo/I2TS2002-PardoEtAl.pdf – HappyTimeGopher May 04 '12 at 17:26
1

Yahoo has a free API for this: http://developer.yahoo.com/search/content/V1/termExtraction.html

Eugene Osovetsky
  • 6,443
  • 2
  • 38
  • 59
  • This service extracts keywords from a given string. Nice, but not answering the question. – Jacco Oct 07 '09 at 11:01
1

Apple's patent 6424362 - Auto-summary of document content contains sample code which might be useful...

Stobor
  • 44,246
  • 6
  • 66
  • 69
0

This borders on artificial intelligence so there's not going to be an "easy" solution out there, but there are products that target this problem.

Check out Copernic Summarizer, for one.

David Boike
  • 18,545
  • 7
  • 59
  • 94
0

Noun phrases typically tend to be important elements of a sentence. Picking sentence(s) with a high density of noun phrases could yield a good summary. You could get noun phrases using a POS tagger.

For a good summary, it is desirable that it is a meaningful sentence. Reading a broken sentence is slightly jarring.

Shashikant Kore
  • 4,952
  • 3
  • 31
  • 40
0

Alternatively, when the author posts the article, the author can highlight what are the keywords that can be used in the description which can then be automatically put in the meta description tag.

vikramjb
  • 1,365
  • 3
  • 25
  • 50
  • I've been thinking about this option.. but I would like to keep the system as easy as possible for the user. So this option is not possible. (It is great for paid contributions and stuff, but not for my audience) – Jacco Oct 07 '09 at 09:41