Getting a "summary" of a webpage

Question

I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.

It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...

Anyone have any good ideas? :) It doesn't have to be foolproof

score 6 · Accepted Answer · edited Nov 27 '17 at 13:24

6

So, you wanna become a new Google, heh? :-)

Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.

Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.

If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.

If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.

As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)

edited Nov 27 '17 at 13:24

carla

1,970
1
31
44

answered May 31 '10 at 05:41

Luc

1,488
1
13
31

You heard it here first ;) But actually I think google has it easier since they probably just look at the phrase/word in the searchtext and take the surrounding text. Looking for the first h1, or h2 and parsing from there is a great idea, thanks. As you said one probably has to use a hybrid of different techniques to cover different scenarios but that's a nice start. I'll probably start out by replacing out some tags with regex, run it through something to make it wellformed and using XML DOM/Xpath stuff from there – Homde May 31 '10 at 07:09
I would *strongly* recommend against using regex. As I said, I've used htmlagilitypack. It uses xpath to traverse html document which, imo, is much cleaner. Also, see http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak – Luc May 31 '10 at 07:47
just regex for stripping some tags, I agree navigating html with regex is insane :) Navigating a XML Tree though is very simple and efficient – Homde May 31 '10 at 08:09

score 2 · Answer 2 · answered May 31 '10 at 05:13

2

Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.

This will always have its problems.

answered May 31 '10 at 05:13

alex

479,566
201
878
984

score 0 · Answer 3 · answered May 31 '10 at 05:56

0

You can strip the HTML tags using this regular expression

string stripped = Regex.Replace(textBox1.Text,@"<(.|\n)*?>",string.Empty)

You will them get the content text you can use to generate your paragraphs.

answered May 31 '10 at 05:56

SiN

3,704
2
31
36

Ew... I don't think this will work very well at all! You'll wind up with a bunch of jibberish.. a bunch of headers and links mashed together into non-sense. – mpen May 31 '10 at 06:08

Getting a "summary" of a webpage

3 Answers3