3

how to get description/content of web page for given URL. (Something like Google gives the short description of each resulting link). I want to do this in my jsp page.

Thank in advance!

Jarrett Widman
  • 6,329
  • 4
  • 23
  • 32
smartcode
  • 247
  • 9
  • 17

1 Answers1

4

Idea: Open the URL as a stream, then HTML-parse the String in its description meta tag.

Grab URL content:

URL url = new URL("http://www.url-to-be-parsed.com/page.html");
    BufferedReader in = new BufferedReader(
                new InputStreamReader(
                url.openStream()));

Will need to tweak the above code depending on what your HTML parser library requires (a stream, strings, etc).

HTML-Parse the tags:

<meta name="description" content="This is a place where webmasters can put a description about this web page" />

You might also be interested in grabbing the title of that page:

<title>This is the title of the page!</title>

Caution: Regular expressions do not seem to work reliably on HTML documents, so a HTML-parser is better.

An example with HTML Parser:

  1. Use HasAttributeFilter to filter by tags that have name="description" attribute
  2. try a Node ---> MetaTag casting
  3. Get the content using MetaTag.getAttribute()

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

Considerations:

If this is done in a JSP each time the page is loaded, you might get a slowdown due to the network I/O to the URL. Even worse if you do this each time on-the-fly for a page of yours that has many URL links in it, then the slowdown could be massive due to the sequential operation of n URLs. Maybe you can store this information in a database and refresh them as needed instead of doing in it on-the-fly in the JSPs.

Community
  • 1
  • 1
bakkal
  • 54,350
  • 12
  • 131
  • 107
  • ::Thank you very much for your reply.I want to extract the content information of the meta tag.I'm using html parser (http://htmlparser.sourceforge.net/samples.html). could you please help me.. – smartcode Jun 30 '10 at 15:29
  • There you go. Took me a while to make my way around their API. Seems to work fine as it is. As I will be using this too, I'll update if I find more efficient ways. – bakkal Jun 30 '10 at 17:13
  • ::One more question bro..Is it possible to get value of title also based on your answer?? I have tried out based on your answer ..But still couldn't get the result! Any Idea..?? Thank in advance! – smartcode Jul 09 '10 at 09:16