Questions tagged [crawler4j]

Crawler4j is an open source Java web crawler.

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Reference: https://github.com/yasserg/crawler4j

174 questions
11
votes
1 answer

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup. Both of them are capable retrieving the content of a page and extract sub-parts of it.…
Mike
  • 14,010
  • 29
  • 101
  • 161
9
votes
3 answers

Web Crawling (Ajax/JavaScript enabled pages) using java

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site.…
Amar
  • 755
  • 2
  • 16
  • 36
8
votes
1 answer

Crawler in Groovy (JSoup VS Crawler4j)

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects…
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
7
votes
1 answer

XPath following-sibling for crawling not returning sibling

I am trying to create a crawler to extract some attribute data from supplier websites that I can audit against our internal attribute database and am new to import.io. I watched a bunch of videos, but though my syntax seems to be right, my manual…
Elizabeth VO
  • 111
  • 1
  • 6
7
votes
1 answer

Debug into Maven Dependency Source w/IntelliJ

I'm debugging a Maven project in IntelliJ and I'm trying to figure out how to step into the source of one of my dependencies that's specified in my pom.xml. Specifically, my project has a dependency on Crawler4J I'm seeing some weird behaviour from…
Tyson
  • 1,685
  • 15
  • 36
6
votes
2 answers

How can I get crawler4j to download all links from a page more quickly?

What I do is: - crawl the page - fetch all links of the page, puts them in a list - start a new crawler, which visits each links of the list - download them There must be a quicker way, where I can download the links directly when I visit the page?…
seinecle
  • 10,118
  • 14
  • 61
  • 120
6
votes
2 answers

Syntax error, insert "... VariableDeclaratorId" to complete FormalParameterList

I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import…
Dinesh Purty
  • 123
  • 2
  • 2
  • 10
6
votes
2 answers

Parsing robot.txt using java and identify whether an url is allowed

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about…
Emily Webb
  • 325
  • 2
  • 6
  • 16
6
votes
1 answer

Why does the crawler4j example give an error?

I'm trying to use the Basic crawler example in crawler4j. I took the code from the crawler4j website here. package edu.crawler; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import…
j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
5
votes
1 answer

guide to setup crawler4j

I would like to setup the crawler to crawl a website, let say blog, and fetch me only the links in the website and paste the links inside a text file. Can you guide me step by step for setup the crawler? I am using Eclipse.
Wai Loon II
  • 259
  • 3
  • 7
  • 19
4
votes
1 answer

Improving performance of crawler4j

I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file. I've ran Crawler4j on the…
Gideon
  • 2,211
  • 5
  • 29
  • 47
4
votes
2 answers

Disable RobotServer in crawler4j

I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j. My problem comes with some web pages that have disabled the robots with
King Midas
  • 1,442
  • 4
  • 29
  • 50
4
votes
1 answer

Crawling PDF's with Crawler4j

i currently using crawler4j to crawl a website and return the page url's and that pages parent page url too. i am using the basic crawler which is working fine except it is not returning the PDF's. i know it crawling the PDF's because i have checked…
4
votes
1 answer

crawler4j always returns fatal transport error

This is what I get for any seed I add to crawler4j. ERROR [Crawler 1] Fatal transport error: Connection to http://example.com refused while fetching http://example.com/page.html (link found in doc #0) This is really weird for me. I don't know what…
Ali Hashemi
  • 3,158
  • 3
  • 34
  • 48
4
votes
3 answers

How to crawl my site to detect 404/500 errors?

Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)? I also want to be able to set it to only click into 1 of each type of link. So if I have 1000…
Ryan
  • 22,332
  • 31
  • 176
  • 357
1
2 3
11 12