9

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site. http://www.sciencedirect.com/science/article/pii/S1568494612005741. I want to crawl the following information from the aforementioned site (Please take a look at the attached screenshot).

enter image description here

If you observe the attached screenshot it has three names (Highlighted in red boxes). If you click one of the link you will see a popup and that popup contains the whole information about that author. I want to crawl the information which are there in that popup.

I am using the following code to crawl the content.

public class WebContentDownloader {

private Parser parser;
private PageFetcher pageFetcher;

public WebContentDownloader() {
    CrawlConfig config = new CrawlConfig();
    parser = new Parser(config);
    pageFetcher = new PageFetcher(config);
}

private Page download(String url) {
    WebURL curURL = new WebURL();
    curURL.setURL(url);
    PageFetchResult fetchResult = null;
    try {
        fetchResult = pageFetcher.fetchHeader(curURL);
        if (fetchResult.getStatusCode() == HttpStatus.SC_OK) {
            try {
                Page page = new Page(curURL);
                fetchResult.fetchContent(page);
                if (parser.parse(page, curURL.getURL())) {
                    return page;
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    } finally {
        if (fetchResult != null) {
            fetchResult.discardContentIfNotConsumed();
        }
    }
    return null;
}

private String processUrl(String url) {
    System.out.println("Processing: " + url);
    Page page = download(url);
    if (page != null) {
        ParseData parseData = page.getParseData();
        if (parseData != null) {
            if (parseData instanceof HtmlParseData) {
                HtmlParseData htmlParseData = (HtmlParseData) parseData;
                return htmlParseData.getHtml();
            }
        } else {
            System.out.println("Couldn't parse the content of the page.");
        }
    } else {
        System.out.println("Couldn't fetch the content of the page.");
    }
    return null;
}

public String getHtmlContent(String argUrl) {
    return this.processUrl(argUrl);
}
}

I was able to crawl the content from the aforementioned link/site. But it doesn't have the information what I marked in the red boxes. I think those are the dynamic links.

  • My question is how can I crawl the content from the aforementioned link/website...???
  • How to crawl the content from Ajax/JavaScript based websites...???

Please can anyone help me on this.

Thanks & Regards, Amar

Amar
  • 755
  • 2
  • 16
  • 36

3 Answers3

6

Hi I found the workaround with the another library. I used Selinium WebDriver (org.openqa.selenium.WebDriver) library to extract the dynamic content. Here is the sample code.

public class CollectUrls {

private WebDriver driver;

public CollectUrls() {
    this.driver = new FirefoxDriver();
    this.driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
}

protected void next(String url, List<String> argUrlsList) {
    this.driver.get(url);
    String htmlContent = this.driver.getPageSource();
}

Here the "htmlContent" is the required one. Please let me know if you face any issues...???

Thanks, Amar

Dinesh Kumar P
  • 1,128
  • 2
  • 18
  • 32
Amar
  • 755
  • 2
  • 16
  • 36
  • Thanks amar. can u brief explain about to me – BasK Dec 03 '14 at 10:03
  • @Amar, I tried the same coding and replaced **url** with the dynamic webpage you mentioned ***http://www.sciencedirect.com/science/article/pii/S1568494612005741***; Yet it didn't crawled the popup page; It crawled only the static page; Does your solution require any other additional code? – Dinesh Kumar P Dec 09 '14 at 04:02
  • Hi Kumar, If you use crawler-4j you won't see the whole html content (not even static page content). Say for example use the crawler-4j and grab the html content and search for those names (mentioned in the screen shot). You won't find those names in your html content because those names will render in a dynamic way. But we can see those names by inspecting that element. So there is a difference when you look at the page source and when you inspect an element. So by using this selenium web driver we can get the html content (and its same as the inspected element content). – Amar Dec 09 '14 at 11:11
  • I will post the whole code which prints the names from that URL. – Amar Dec 09 '14 at 11:13
  • But isn't there the problem, that Crawler4j itself can't detect the outgoing links because they are loaded dynamically - with this solution you only crawl one site dynamically but crawler4j will stop because it can't find any other urls. Is there any solution for that problem? – Fabian Lurz Nov 13 '15 at 16:52
5

Simply said, Crawler4j is static crawler. Meaning that it can't parse the JavaScript on a page. So there is no way of getting the content you want by crawling that specific page you mentioned. Of course there are some workarounds to get it working.

If it is just this page you want to crawl, you could use a connection debugger. Check out this question for some tools. Find out which page the AJAX-request calls, and crawl that page.

If you have various websites which have dynamic content (JavaScript/ajax), you should consider using a dynamic-content-enabled crawler, like Crawljax (also written in Java).

Community
  • 1
  • 1
Erwin
  • 3,298
  • 2
  • 15
  • 22
  • does dynamic content include gmail? Crawljax would, theoretically, be able to handle that? – Thufir Oct 06 '14 at 11:08
  • Theoretically yes. In practice you will have to do a lot of optimizing and tweaks to get it working at a reasonable speed. If you want to scrape mails, my guess, try looking at https://developers.google.com/gmail/ – Erwin Oct 10 '14 at 11:30
  • @pyerwin, https://github.com/crawljax/crawljax/issues/3 Is this feature really added in Crawljax? The above issue is **Closed** not **Fixed**, So I had this doubt – Dinesh Kumar P Dec 09 '14 at 05:30
  • @Kumar the issue you mentioned was a imported issue from the old Google-group, so it needed to be manually closed. (I)Frame support has been in Crawljax since version 2.0 – Erwin Dec 09 '14 at 12:50
1
I have find out the Solution of Dynamic Web page Crawling using Aperture and Selenium.Web Driver.
Aperture is Crawling Tools and Selenium is Testing Tools which can able to rendering Inspect Element. 

1. Extract the Aperture- core Jar file by Decompiler Tools and Create a Simple Web Crawling Java program. (https://svn.code.sf.net/p/aperture/code/aperture/trunk/)
2. Download Selenium. WebDriver Jar Files and Added to Your Program.
3. Go to CreatedDataObjec() method in org.semanticdesktop.aperture.accessor.http.HttpAccessor.(Aperture Decompiler).
Added Below Coding 

   WebDriver driver = new FirefoxDriver();
   String baseurl=uri.toString();
   driver.get(uri.toString());
   String str = driver.getPageSource();
        driver.close();
 stream= new ByteArrayInputStream(str.getBytes());
BasK
  • 284
  • 8
  • 24