Trying to find specific links while web crawling

Question

I am modifying the code given in [crawler4j][1]. I want to find specific links while crawling a web site. For ex I am crawling on www.cmu.edu and I am trying to get the link for directory search. Here is my code for it -

public void visit(Page page) {          
    String url = page.getWebURL().getURL();
//  System.out.println("URL: " + url);

    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String text = htmlParseData.getText();
        String html = htmlParseData.getHtml();
        System.out.println(html.matches(".*<a href.*."));
        if (html.matches(".*.<a href=.*.>Directory Search</a>.*."))
            System.out.println("***********Hello*********************");
        //  System.out.println("----------"+html);
        return;
//      List<WebURL> links = htmlParseData.getOutgoingUrls();
    }
}

This code does not work. I am not getting the *******Helo********* on my console. Just to check I printed the html string in console and I copied the anchor tag that contains the directory sreach and I wrote this simple two line code -

String test2="<li class=\"first\"><a href=\"http://directory.andrew.cmu.edu/\" title=\"Carnegie Mellon University Faculty, Staff and Student Directory\">Directory Search</a></li>";
System.out.println("*******"+test2.matches(".*.<a href=.*.>Directory Search</a>.*."));

This works. The value of String test2 is copied from the console. What am I doing wrong in the first part of the code?

[1]

Are you sure htmlParseData.getHtml() is returning something? — eddie_cat, Jul 14 '14 at 20:23
@Savanna - Yes it is returning the entire html code. I did print it out on the console and copy the value of String test2 from that only. — user3720936, Jul 14 '14 at 20:27
[Don't parse HTML with regexes.](http://stackoverflow.com/a/1732454/574479) No, really, just don't. — biziclop, Jul 14 '14 at 20:31

score 0 · Answer 1 · answered Jul 25 '14 at 12:24

Try this (you have to use (?s) to match also new line characters)

String test2="qwert\n\n<li class=\"first\"><a href=\"http://directory.andrew.cmu.edu/\" title=\"Carnegie Mellon University Faculty, Staff and Student Directory\">Directory Search</a></li>";
System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

Trying to find specific links while web crawling

1 Answers1