Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
-
3Those terms do not have precise definitions. Do you have usage examples? – Greg Hewgill Jul 08 '10 at 19:57
-
I want to write an application that walks over a web site based on some xpath based rules (follow specific hyperlinks) and then extract data from some leaf pages. So it includes both crawling and scraping. I need to find out best possible tools for both the steps. – Nayn Jul 08 '10 at 20:09
-
Lots of platforms are perfectly good at downloading web pages and applying RegExp to extract links or scraped values. Use what you know. – Steven Sudit Jul 08 '10 at 20:17
-
See also: http://stackoverflow.com/questions/4327392/crawling-vs-web-scraping – David J. Jul 12 '12 at 13:11
7 Answers
A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).
A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.
Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Many sites include a file named robots.txt in their root (i.e. having the URL http://server/robots.txt
) to specify how (and if) crawlers should treat that site -- in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.

- 178,213
- 47
- 333
- 501

- 476,176
- 80
- 629
- 1,111
Crawlers surf the web, following links. An example would be the Google robot that gets pages to index. Scrapers extract values from forms, but don't necessarily have anything to do with the web.

- 19,391
- 1
- 51
- 53
-
7
-
3Scrapers extract value from screens, not necessarily HTML. For example, I once used a scraper to extract values from old mainframe forms. – Steven Sudit Jul 08 '10 at 20:02
-
7I can't give Google a free pass on this. Google is a crawler, yes, but ALSO a scraper. How else do they have the meta description to display in the search results? the title? the dates of posts? They're the ultimate crawler AND scraper. – Henley Nov 30 '12 at 23:18
Web crawler gets links (Urls - Pages) in a logic and scraper get values (extracting) from HTML.
There are so many web crawler tools. Visit page to see some. Any XML - HTML parser can used to extract (scrape) data from crawled pages. (I recommend Jsoup for parsing and extracting data)

- 6,656
- 4
- 18
- 22

- 579
- 1
- 5
- 14
Generally, crawlers would follow the links to reach numerous pages while scrapers is, in some sense, just pulling the contents displayed online and would not reach the deeper links.
The most typical crawler is google bots, which would follow the links to reach all the web pages on your website and would index the contents if they found it useful(that's why you need robots.txt to tell which contents you do not want to be indexed). So we could search such kind of contents on its website. While the purpose of scrapers is just to pull the contents for personal uses and would not have much effects on others.
However, there's no distinct difference about crawlers and scrapers now as some automated web scraping tools also allow you to crawl the website by following the links, like Octoparse and import.io. They are not the crawlers like google bots, but they are able to automatically crawl the websites to get numerous data without coding.

- 33
- 3
Scrapers and crawlers do not always distinguish, I mean - you can find crawlers which scrape, in fact, Scraper Crawler is doing both and is named accordingly:
- it crawls to a URL i.e. indexes all the URL in that main URL
- depth of crawling is how far the indexing goes in the URL tree
- then it scrapes whatever you define in a regexp

- 566
- 8
- 12
I know this question is quite old, but I'll respond anyway for the newcomer that will wonder here.
From what I can gather and understand it seems that those two terms are often confused with each other due to their similarity and people will often refer to them as the same thing.
However, they are not quite the same. A crawler(or spider) will follow each link in the page it crawls from the starter page. This is why it is also referred to as a spider bot since it will create a kind of a spider web of pages.
A scraper will extract the data from a page, usually from the pages downloaded with the crawler.
If you are interested in either of those, you can try the Norconex HTTP Collector.

- 11
- 2
A crawler is a program that systematically navigates web pages, following links to gather information. A scraper is a tool that extracts specific data from websites. Crawlers explore, while scrapers extract.
A crawler would start at the homepage of the website, follow links to different product pages, and collect the URLs of those pages. It would continue this process, exploring various pages and collecting data along the way.
On the other hand, a scraper would focus on a specific product page. It would extract the desired information, such as the product name, price, and description, from that particular page. The scraper would repeat this process for each product page of interest.
In summary, the crawler would navigate through the website, while the scraper would extract specific data from individual pages.
Also, Read, - Web Scraping vs Crawling

- 1