102

Is there a difference between Crawling and Web-scraping?

If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised search engine?

wassimans
  • 8,382
  • 10
  • 47
  • 58
  • 14
    Scraping means pulling content from a page. Crawling means following links to reach numerous pages. Crawlers have to scrape, and that's for two reasons: one is that useful crawlers don't just traverse pages for nothing; they collect info (e.g. indexing words to build a search index for a search engine). Secondly, they have to discover links to other pages. – Kaz Oct 10 '13 at 23:50

6 Answers6

132

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently.

Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) crawler wouldn't do, i.e.:

  • Have no regard for robots.txt
  • Identify itself as a browser
  • Submit forms with data
  • Execute Javascript (if required to act like a user)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Ben
  • 3,922
  • 1
  • 22
  • 21
  • 1
    @Ben Do you know where I can find out more about how a web scraper identifies itself as a browser? Wikipedia says "implementing low-level Hypertext Transfer Protocol (HTTP)" but I'd like to really know more how it works. – Honinbo Shusaku Jul 13 '15 at 18:28
  • 2
    @Abdul in HTTP requests, you can specify a "User-Agent" property to identify yourself. If you for instance set this to "Mozilla/5.0 ... Chrome" or something that Chrome uses, your scraper would look like a browser to the server. – Amani Kilumanga Mar 16 '16 at 00:17
72

Yes, they are different. In practice, you may need to use both.

(I have to jump in because, so far, the other answers don't get to the essence of it. They use examples but don't make the distinctions clear. Granted, they are from 2010!)

Web scraping, to use a minimal definition, is the process of processing a web document and extracting information out of it. You can do web scraping without doing web crawling.

Web crawling, to use a minimal definition, is the process of iteratively finding and fetching web links starting from a list of seed URL's. Strictly speaking, to do web crawling, you have to do some degree of web scraping (to extract the URL's.)

To clear up some concepts mentioned in the other answers:

  • robots.txt is intended to apply to any automated process that accesses a web page. So it applies to both crawlers and scrapers.

  • 'Proper' crawlers and scrapers, both, should identify themselves accurately.

Some references:

David J.
  • 31,569
  • 22
  • 122
  • 174
8

AFAIK Web Crawling is what Google does - it goes around a website looking at links and building a database of the layout of that site and sites it links to

Web Scraping would be the progamatic analysis of a web page to load some data off of it, EG loading up BBC weather and ripping (scraping) the weather forcast off of it and placing it elsewhere or using it in another program.

Chris Harden
  • 356
  • 2
  • 3
3

There's a fundamental difference between these two. For those looking to dig deeper, I suggest you read this - Web scraper, Web Crawler

This post goes into detail. A good summary is in this chart from the article: chart showing difference between scraping and crawling

Freiheit
  • 8,408
  • 6
  • 59
  • 101
  • 4
    Note that [link-only answers](http://meta.stackoverflow.com/tags/link-only-answers/info) are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference. – kleopatra Sep 06 '13 at 10:18
  • Hey @Mohit the link is broken... any other source – konzo May 09 '16 at 01:53
1

There's definitely a difference between these two. One refers to visiting a site, the other to extracting.

Annie
  • 11
  • 1
0

We crawl sites to have broad perspective how the site is structured, what are connections between pages, to estimate how much time we need to visit all pages we are interested in. Scraping is often harder to implement, but it’s an essence of data extraction. Let’s think of scraping as of covering website with sheet of paper with some rectangles cut out. We can now see only things we need, completely ignoring parts of website that are common for all pages (like navigation, footer, ads), or extraneous informations as comments or breadcrumbs. More about differences between crawling and scrapping you find here: https://tarantoola.io/web-scraping-vs-web-crawling/

shirk3y
  • 142
  • 3