I am doing a school project which needs extracting data from web pages. To be precise I need a library or opensource program to extract human readable content from html/text data. Something like web browser rendered text content.
I know parsing html with regexs is worst method to extract text from it.
Extra info:
I need it for computing similarity between text documents.
Any help would be appreciated. Thanks