-1

Possible Duplicate:
Text extraction with java html parsers

I m new to java and is trying to program an algorithm for web page classification. I want to know how to extract text from HTML web pages in java. Would be of great help if I could get a base idea of what to do.

Thanks Archana

Community
  • 1
  • 1
user656673
  • 21
  • 2
  • 5
  • Also a possible duplicate of http://stackoverflow.com/questions/1386107/text-extraction-from-html-java & http://stackoverflow.com/questions/3036638/how-to-extract-web-page-textual-content-in-java – Saurabh Gokhale Mar 12 '11 at 15:13

3 Answers3

0

You could turn to already existing HTML parsing tools, such as JSOUP, once you obtained the raw HTML String.

look here for a comparison What are the pros and cons of the leading Java HTML parsers?

Also find a quick example of what you could easily extract from an HTML page using JSOUP and the CSS selectors http://jsoup.org/cookbook/extracting-data/example-list-links

Community
  • 1
  • 1
Joey
  • 1,349
  • 14
  • 26
0

I use Jericho to convert an HTML document to text. The code to get the text is pretty simple:

    Source source = new Source(html);
    Renderer renderer = source.getRenderer();
    String text = renderer.toString();

There are some options you can set on the renderer to adjust the texification, like:

renderer.setIncludeHyperlinkURLs(false);
bmargulies
  • 97,814
  • 39
  • 186
  • 310
Cooper
  • 1,267
  • 11
  • 16
-1

@Codemwnci's answer helps you download the HTML page.

If you're looking for a way to separate HTML markup tags from content, you should use an HTML parser.

Mozart Brocchini
  • 352
  • 1
  • 3
  • 11
  • -1 for suggesting regular expressions to parse HTML. – Richard H Mar 14 '11 at 15:13
  • @Richard, I agree that regular expressions will probably not be the best choice, but I also suggested using a parser, I actually edited my response and reordered the suggestions after your -1. The reason for suggesting regular expressions is cases like only getting the text from a certain HTML tag. – Mozart Brocchini Mar 14 '11 at 16:01