2

I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.

It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...

Anyone have any good ideas? :) It doesn't have to be foolproof

Homde
  • 4,246
  • 4
  • 34
  • 50

3 Answers3

6

So, you wanna become a new Google, heh? :-)

Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.

Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.

If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.

If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.

As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)

carla
  • 1,970
  • 1
  • 31
  • 44
Luc
  • 1,488
  • 1
  • 13
  • 31
  • You heard it here first ;) But actually I think google has it easier since they probably just look at the phrase/word in the searchtext and take the surrounding text. Looking for the first h1, or h2 and parsing from there is a great idea, thanks. As you said one probably has to use a hybrid of different techniques to cover different scenarios but that's a nice start. I'll probably start out by replacing out some tags with regex, run it through something to make it wellformed and using XML DOM/Xpath stuff from there – Homde May 31 '10 at 07:09
  • I would *strongly* recommend against using regex. As I said, I've used htmlagilitypack. It uses xpath to traverse html document which, imo, is much cleaner. Also, see http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak – Luc May 31 '10 at 07:47
  • just regex for stripping some tags, I agree navigating html with regex is insane :) Navigating a XML Tree though is very simple and efficient – Homde May 31 '10 at 08:09
2

Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.

This will always have its problems.

alex
  • 479,566
  • 201
  • 878
  • 984
0

You can strip the HTML tags using this regular expression

string stripped = Regex.Replace(textBox1.Text,@"<(.|\n)*?>",string.Empty)

You will them get the content text you can use to generate your paragraphs.

SiN
  • 3,704
  • 2
  • 31
  • 36
  • Ew... I don't think this will work very well at all! You'll wind up with a bunch of jibberish.. a bunch of headers and links mashed together into non-sense. – mpen May 31 '10 at 06:08