2

The big mission: I am trying to get a few lines of summary of a webpage. i.e. I want to have a function that takes a URL and returns the most informative paragraph from that page. (Which would usually be the first paragraph of actual content text, in contrast to "junk text", like the navigation bar.)

So I managed to reduce an HTML page to a bunch of text by cutting out the tags, throwing out the <HEAD> and all the scripts. But some of the text is still "junk text". I want to know where the actual paragraphs of text begin. (Ideally it should be human-language-agnostic, but if you have a solution only for English, that might help too.)

How can I figure out which of the text is "junk text" and which is actual content?

UPDATE: I see some people have pointed me to use an HTML parsing library. I am using Beautiful Soup. My problem isn't parsing HTML; I already got rid of all the HTML tags, I just have a bunch of text and I want to separate the context text from the junk text.

David Z
  • 128,184
  • 27
  • 255
  • 279
Ram Rachum
  • 84,019
  • 84
  • 236
  • 374
  • 1
    Can you post a sample of the text you have? And what you want it to become? Regarding parsing HTML with regex - obligatory link - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oded Jul 24 '10 at 16:18
  • Here's a sample of text from a web page: http://cool-rr.com/sample_text.delete_me.txt It happens to be a page from Python's documentation. – Ram Rachum Jul 24 '10 at 16:41
  • I removed the `[regex]` tag because it seems to be tricking people into thinking that you're trying to use regular expressions to extract the text from the page, but that's not what your question is about at all. This is really a text processing question. It barely has anything to do with HTML at all; the fact that the text was extracted from a web page doesn't matter much, except to the extent that you want to try using the HTML markup to help you identify the important pieces of text. – David Z Jul 24 '10 at 18:39

4 Answers4

2

A general solution to this problem is a non-trivial problem to solve.

To put this in context, a large part of Google's success with search has come from their ability to automatically discern some semantic meaning from arbitrary Web pages, namely figuring out where the "content" is.

One idea that springs to mind is if you can crawl many pages from the same site then you will be able to identify patterns. Menu markup will be largely the same between all pages. If you zero this out somehow (and it will need to fairly "fuzzy") what's left is the content.

The next step would be to identify the text and what constitutes a boundary. Ideally that would be some HTML paragraphs but you won't get that lucky most of the time.

A better approach might be to find the RSS feeds for the site and get the content that way because that will be stripped down as is. Ignore any AdSense (or similar) content and you should be able to get the text.

Oh and absolutely throw out your regex code for this. This requires an HTML parser absolutely without question.

cletus
  • 616,129
  • 168
  • 910
  • 942
  • 1
    Cletus, the HTML is a non-issue. The tags don't interest me, I throw all of them out. The reason I'm thinking about regex is to use it for telling which pieces of text are flowing paragraphs and which are link texts from the navigation bar (or other small bits of text.) – Ram Rachum Jul 24 '10 at 16:47
2

You could use the approach outlined at the AI depot blog along with some python code:

ars
  • 120,335
  • 23
  • 147
  • 134
1

Probably a bit overkill, but you could try nltk, the Natural Language Toolkit. That library is used for parsing natural languages. It's quite a nice library and an interesting subject. If you want to just get sentences from a text you would do something like:

>>> import nltk
>>> nltk.sent_tokenize("Hi this is a sentence. And isn't this a second one, a sentence with a url http://www.google.com in it?")
['Hi this is a sentence.', "And isn't this a second one, a sentence with a url http://www.google.com in it?"]

Or you could use the sentences_from_text method from the PunktSentenceTokenizer class. You have to do nltk.download() before you get started.

SiggyF
  • 22,088
  • 8
  • 43
  • 57
0

I'd recommend having a look at what Readability does. Readability strips out all but the actual content of the page and restyles it for easy reading. It seems to work very well in terms of detecting the content from my experience.

Have a look at its source code (particularly the grabArticle function) and maybe you can get some ideas.

Liquid_Fire
  • 6,990
  • 2
  • 25
  • 22