How extract meaningful text from HTML

Question

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?

I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.

Thanks!!

PD: Please do not recommend anything with java

UPDATE: I found this link text

Sadly, is in python

Requiring that the text be *meaningful* makes this a much more difficult task. — Rob Kennedy, Oct 19 '10 at 15:22
Yes, but apparently the "statistical" approach is the right answer — Nisanio, Oct 19 '10 at 15:36

score 6 · Accepted Answer · edited May 23 '17 at 12:09

Use Nokogiri, which is fast and written in C, for Ruby.

(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)

With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.

score 2 · Answer 2 · edited May 23 '17 at 12:33

2

Solutions integrating with Ruby

use Nokogiri as recommended by Amigable Clark kant
Use Hpricot

External Solutions

If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).

edited May 23 '17 at 12:33

Community

1
1

answered Oct 19 '10 at 14:45

haylem

22,460
3
67
96

score -1 · Answer 3 · answered Oct 19 '10 at 14:36

-1

Lynx is able to do this. This is open source if you want to take a look at it.

answered Oct 19 '10 at 14:36

mouviciel

66,855
13
106
140

But spawning a separate program is not my idea of fast. – Prof. Falken Oct 19 '10 at 14:46
yes, you right. The website will crawl several pages and extract is text. The idea is to separate the text of the news the rest of the text. It must to be very fast. – Nisanio Oct 19 '10 at 14:56
I don't suggest to use lynx as is. You can take whatever is of interest for you from the source code and compile it as a library. – mouviciel Oct 19 '10 at 14:59

Notinlist · Answer 4 · 2010-10-21T13:32:35.773

-3

You should strip all angle-bracketed part from text and then collapse white-spaces. In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.

Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.

UPDATE: And you should start after finding the <body> tag.

edited Oct 21 '10 at 13:32

answered Oct 19 '10 at 14:37

Notinlist

16,144
10
57
99

I would not recommend using regular expressions to parse HTML or any other format like it. (Except maybe trivial cases, but as a general rule, avoid.) – Prof. Falken Oct 19 '10 at 14:46
4

Regex + HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Nick T Oct 19 '10 at 14:49
1st: @ Amigable Clark Kant: We are not talking about parsing, we are talking about stripping. A correct HTML can be stripped with regexp. If we have that in our specification then we can use it safely --- 2nd: You both misunderstood me. I did not recommend regexp for it. I expressed my idea about an algorithm and invoked the "regexp" phrase as a human language tool. I could write ``. – Notinlist Oct 19 '10 at 15:06

How extract meaningful text from HTML

4 Answers4

Solutions integrating with Ruby

External Solutions