4

I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j.

My problem comes with some web pages that have disabled the robots with <meta name="robots" content="noindex,nofollow" /> that make sense to not index this web pages in a search engine due to the content it have.

The crawler4j also is not following these links despite disable the configuration of the RobotServer. This must be very easy with robotstxtConfig.setEnabled(false);:

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...

But the described web pages are still not explored. I have read the code and this must be enough to disable the robots directives, but it is not working as expected. Maybe I am skipping something? I have tested it with versions 3.5 and 3.6-SNAPSHOT with identical result.

King Midas
  • 1,442
  • 4
  • 29
  • 50

2 Answers2

2

I am using a new version

   <dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.1</version>
    </dependency>`

After setting RobotstxtConfig like this, it is working:

    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    robotstxtConfig.setEnabled(false);

Testing result and Source Code in from Crawler4J proves that:

public boolean allows(WebURL webURL) {
if (config.isEnabled()) {
  try {
    URL url = new URL(webURL.getURL());
    String host = getHost(url);
    String path = url.getPath();

    HostDirectives directives = host2directivesCache.get(host);

    if ((directives != null) && directives.needsRefetch()) {
      synchronized (host2directivesCache) {
        host2directivesCache.remove(host);
        directives = null;
      }
    }

    if (directives == null) {
      directives = fetchDirectives(url);
    }

    return directives.allows(path);
  } catch (MalformedURLException e) {
    logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
  }
}

return true;
}

When set Enabled as false, it will not do the check any more.

josliber
  • 43,891
  • 12
  • 98
  • 133
sendon1982
  • 9,982
  • 61
  • 44
0

Why don't you just exclude everything about Robotstxt in crawler4j? I needed to crawl a site and ignore the robots and this worked for me.

I changed CrawlController and WebCrawler in .crawler like that:

WebCrawler.java:

delete

private RobotstxtServer robotstxtServer;

delete

this.robotstxtServer = crawlController.getRobotstxtServer();

edit

 if ((shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
 -->
 if ((shouldVisit(webURL)))

edit

if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && 
              (shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && 
              (shouldVisit(webURL)))

CrawlController.java:

delete

import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

delete

 protected RobotstxtServer robotstxtServer;

edit

public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer) throws Exception
-->
public CrawlController(CrawlConfig config, PageFetcher pageFetcher) throws Exception

delete

this.robotstxtServer = robotstxtServer;

edit

if (!this.robotstxtServer.allows(webUrl)) 
{
  logger.info("Robots.txt does not allow this seed: " + pageUrl);
} 
else 
{
  this.frontier.schedule(webUrl);
}
-->
this.frontier.schedule(webUrl);

delete

public RobotstxtServer getRobotstxtServer()
{
  return this.robotstxtServer;
}
public void setRobotstxtServer(RobotstxtServer robotstxtServer)
{
  this.robotstxtServer = robotstxtServer;
}

Hope that it's what you're looking for.

Hisushi
  • 67
  • 1
  • 11
  • 1
    Thanks for your answer. You are taking about modifying the code of the crawler4j library. I would prefer to not modify the library (to can use updates). Theoretically, we can achieve this behaviour without changing the code. – King Midas Aug 27 '14 at 13:23