Using Jsoup, how can I fetch each and every information resides in each link?

Question

     package com.muthu;
     import java.io.IOException;
     import org.jsoup.Jsoup;
     import org.jsoup.helper.Validate;
     import org.jsoup.nodes.Document;
     import org.jsoup.nodes.Element;
     import org.jsoup.select.Elements;
     import org.jsoup.select.NodeVisitor;
     import java.io.BufferedWriter;
     import java.io.File;
     import java.io.FileWriter;
     import java.io.IOException;
     import org.jsoup.nodes.*;
     public class TestingTool 
     {
        public static void main(String[] args) throws IOException
        {
    Validate.isTrue(args.length == 0, "usage: supply url to fetch");
            String url = "http://www.stackoverflow.com/";
            print("Fetching %s...", url);
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            System.out.println(doc.text());
            Elements tags=doc.getElementsByTag("div");
            String alls=doc.text();
            System.out.println("\n");
            for (Element link : links)
            {
        print("  %s  ", link.attr("abs:href"), trim(link.text(), 35));
            }
            BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool                 
            /linknames.txt")));        
         for (Element link : links) {
            bw.write("Link: "+ link.text().trim());
        bw.write(System.getProperty("line.separator"));       
       }    
      bw.flush();     
      bw.close();
    }           }
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
        }

ollo · Answer 1 · 2012-12-12T14:05:30.770

If you connect to an URL it will only parse the current page. But you can 1.) connect to an URL, 2.) parse the informations you need, 3.) select all further links, 4.) connect to them and 5.) continue this as long as there are new links.

considerations:

You need a list (?) or something else where you've store the links you already parsed
You have to decide if you need only links of this page or externals too
You have to skip pages like "about", "contact" etc.

Edit:
(Note: you have to add some changes / errorhandling code)

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

You have to add more restrictions / checks at the part where next links are selected (maybe you want to skip / ignore some); and some error handling.

Edit 2:

To skip ignored links you can use this:

Create a Set / List / whatever, where you store ignored keywords
Fill it with those keywords
Before you call the visitUrl() method with the new Link to parse, you check if this new Url contains any of the ignored keywords. If it contains at least one it will be skipped.

I modified the example a bit to do so (but it's not tested yet!).

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore

// ...


/*
 * Add keywords to the ignorelist. Each link that contains one of this
 * words will be skipped.
 * 
 * Do this in eg. constructor, static block or a init method.
 */
ignore.add(".twitter.com");

// ...


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // Now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            boolean skip = false; // If false: parse the url, if true: skip it
            final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse

            for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
            {
                if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                {
                    skip = true;
                    break;
                }
            }

            if( !skip )
                visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

Parsing the next link is done by this:

final String href = next.absUrl("href");
/* ... */
visitUrl(next.absUrl("href"));

But possibly you should add some more stop-conditions to this part.

Thanks ollo. I can connect the URL and get all the link names. But how can i connect all other links and parsing the information of the links... Give me some suggestions... Thanks in advance.. — Pearl, Dec 09 '12 at 16:17
Please see "edit" for a short example. Extend this for your requirements. — ollo, Dec 09 '12 at 18:30
If my post helped you, feel free to upvote it. However, does it work or do you need further help? — ollo, Dec 11 '12 at 18:37
hai ollo, i need to know how to skip the specific url and how can we go to the next link.... — Pearl, Dec 12 '12 at 05:06
my problem is if my url contains any twitter link then it concentrate on twitter domain... It act as a cycle. I want this loop should break and gets in the next link of the list... I tried a lot... But i stuck..Help me ollo... — Pearl, Dec 12 '12 at 06:49
I posted a possible solution below "edit2" (this code isnt tested!) — ollo, Dec 12 '12 at 14:06

Using Jsoup, how can I fetch each and every information resides in each link?

1 Answers1

Linked