Highest Voted 'crawler4j' Questions

11

votes

1 answer

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup. Both of them are capable retrieving the content of a page and extract sub-parts of it.…

asked Jan 19 '16 at 22:55

Mike

14,010
29
101
161

9

votes

3 answers

Web Crawling (Ajax/JavaScript enabled pages) using java

I am very new to this web crawling. I am using crawler4j to crawl the websites. I am collecting the required information by crawling these sites. My problem here is I was unable to crawl the content for the following site.…

java web-crawler crawler4j

asked Jun 23 '14 at 11:49

Amar

755
2
16
36

8

votes

1 answer

Crawler in Groovy (JSoup VS Crawler4j)

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects…

jsoup web-crawler crawler4j

asked Jun 23 '14 at 17:45

clever_bassi

2,392
2
24
43

7

votes

1 answer

XPath following-sibling for crawling not returning sibling

I am trying to create a crawler to extract some attribute data from supplier websites that I can audit against our internal attribute database and am new to import.io. I watched a bunch of videos, but though my syntax seems to be right, my manual…

xpath crawler4j import.io

asked Jun 05 '15 at 18:46

Elizabeth VO

111
1
6

7

votes

1 answer

Debug into Maven Dependency Source w/IntelliJ

I'm debugging a Maven project in IntelliJ and I'm trying to figure out how to step into the source of one of my dependencies that's specified in my pom.xml. Specifically, my project has a dependency on Crawler4J I'm seeing some weird behaviour from…

java debugging maven intellij-idea crawler4j

asked May 27 '13 at 17:56

Tyson

1,685
15
36

6

votes

2 answers

How can I get crawler4j to download all links from a page more quickly?

What I do is: - crawl the page - fetch all links of the page, puts them in a list - start a new crawler, which visits each links of the list - download them There must be a quicker way, where I can download the links directly when I visit the page?…

java crawler4j

asked Jan 10 '12 at 14:11

seinecle

10,118
14
61
120

6

votes

2 answers

Syntax error, insert "... VariableDeclaratorId" to complete FormalParameterList

I am facing some issues with this code: import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import…

java web-crawler crawler4j

asked Oct 29 '15 at 06:35

Dinesh Purty

123
2
2
10

6

votes

2 answers

Parsing robot.txt using java and identify whether an url is allowed

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed. I am pretty sure that jsoup is not made for this and it's all about…

java web-scraping jsoup crawler4j

asked Oct 12 '13 at 10:07

Emily Webb

325
2
6
16

6

votes

1 answer

Why does the crawler4j example give an error?

I'm trying to use the Basic crawler example in crawler4j. I took the code from the crawler4j website here. package edu.crawler; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import…

java crawler4j

asked Mar 14 '13 at 00:27

j.jerrod.taylor

1,120
1
13
33

5

votes

1 answer

guide to setup crawler4j

I would like to setup the crawler to crawl a website, let say blog, and fetch me only the links in the website and paste the links inside a text file. Can you guide me step by step for setup the crawler? I am using Eclipse.

java web-crawler crawler4j

asked Feb 16 '11 at 05:17

Wai Loon II

259
3
7
19

4

votes

1 answer

Improving performance of crawler4j

I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file. I've ran Crawler4j on the…

java multithreading optimization web-scraping crawler4j

asked Feb 15 '16 at 09:00

Gideon

2,211
5
29
47

4

votes

2 answers

Disable RobotServer in crawler4j

I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j. My problem comes with some web pages that have disabled the robots with

crawler4j

asked Aug 14 '14 at 11:16

King Midas

1,442
4
29
50

4

votes

1 answer

Crawling PDF's with Crawler4j

i currently using crawler4j to crawl a website and return the page url's and that pages parent page url too. i am using the basic crawler which is working fine except it is not returning the PDF's. i know it crawling the PDF's because i have checked…

html url pdf web-crawler crawler4j

asked Aug 13 '14 at 16:44

John Curran

41
2

4

votes

1 answer

crawler4j always returns fatal transport error

This is what I get for any seed I add to crawler4j. ERROR [Crawler 1] Fatal transport error: Connection to http://example.com refused while fetching http://example.com/page.html (link found in doc #0) This is really weird for me. I don't know what…

java web-crawler crawler4j

asked May 10 '13 at 02:42

Ali Hashemi

3,158
3
34
48

4

votes

3 answers

How to crawl my site to detect 404/500 errors?

Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)? I also want to be able to set it to only click into 1 of each type of link. So if I have 1000…

web-crawler crawler4j

asked Jul 24 '12 at 21:31

Ryan

22,332
31
176
357

1

2 3

…

11 12 Next

Questions tagged [crawler4j]