I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j.
My problem comes with some web pages that have disabled the robots with <meta name="robots" content="noindex,nofollow" />
that make sense to not index this web pages in a search engine due to the content it have.
The crawler4j also is not following these links despite disable the configuration of the RobotServer. This must be very easy with robotstxtConfig.setEnabled(false);
:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...
But the described web pages are still not explored. I have read the code and this must be enough to disable the robots directives, but it is not working as expected. Maybe I am skipping something? I have tested it with versions 3.5
and 3.6-SNAPSHOT
with identical result.