64

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.

Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.

Kara
  • 6,115
  • 16
  • 50
  • 57

10 Answers10

150

You'll be reinventing the wheel, to be sure. But here's the basics:

  • A list of unvisited URLs - seed this with one or more starting pages
  • A list of visited URLs - so you don't go around in circles
  • A set of rules for URLs you're not interested in - so you don't index the whole Internet

Put these in persistent storage, so you can stop and start the crawler without losing state.

Algorithm is:

while(list of unvisited URLs is not empty) {
    take URL from list
    remove it from the unvisited list and add it to the visited list
    fetch content
    record whatever it is you want to about the content
    if content is HTML {
        parse out URLs from links
        foreach URL {
           if it matches your rules
              and it's not already in either the visited or unvisited list
              add it to the unvisited list
        }
    }
}
slim
  • 40,215
  • 13
  • 94
  • 127
  • 2
    Great answer, but when you say re-inventing the wheel, where exactly are the free open source web crawler frameworks? possibly for java but i haven't found any for .net. – Anonymous Type Oct 06 '10 at 00:14
  • Ugh, hit enter too soon. That link has a good few, none of which is .Net. However, I don't really understand why you'd choose to restrict yourself to .Net. – slim Oct 07 '10 at 10:30
  • hi, i came across this answer and i thought you can provide me some insights on developing a web crawler. Assuming i have done the above steps, what happen when i have visited all the URLs? do i break out of the while-loop and end the script? or do you run it as a daemon or simple a while loop to retrieve the unvisited URLs again? – chrizonline Jul 28 '14 at 03:19
  • 1
    ahhh, the first thing you might want to do in the `while` loop is add the URL to the `already listed list`... else you might end up in an infinite loop if two pages refer to each other ... – CpILL Feb 18 '18 at 19:20
  • 2
    @CpILL You're right - it's taken 9 years for anyone to notice. Fixed now. – slim Feb 19 '18 at 08:27
30

The complicated part of a crawler is if you want to scale it to a huge number of websites/requests. In this situation you will have to deal with some issues like:

  • Impossibility to keep info all in one database.

  • Not enough RAM to deal with huge index(s)

  • Multithread performance and concurrency

  • Crawler traps (infinite loop created by changing urls, calendars, sessions ids...) and duplicated content.

  • Crawl from more than one computer

  • Malformed HTML codes

  • Constant http errors from servers

  • Databases without compression, wich make your need for space about 8x bigger.

  • Recrawl routines and priorities.

  • Use requests with compression (Deflate/gzip) (good for any kind of crawler).

And some important things

  • Respect robots.txt

  • And a crawler delay on each request to dont suffocate web servers.

stema
  • 90,351
  • 20
  • 107
  • 135
lexmooze
  • 381
  • 3
  • 4
  • 2
    Great answer! You can deal with the RAM issues by using a Bloom Filter. – arachnode.net Mar 03 '13 at 21:31
  • I think the answer to the first 1-3 and 5 is Amazon's AWS. Hashs can solve the 'duplicated content'. Scraping library like Beautiful Soup can handle 6. 7- check your http headers. 8 - use a database with compression. etc – CpILL Feb 18 '18 at 19:25
8

Multithreaded Web Crawler

If you want to crawl large sized website then you should write a multi-threaded crawler. connecting,fetching and writing crawled information in files/database - these are the three steps of crawling but if you use a single threaded than your CPU and network utilization will be pour.

A multi threaded web crawler needs two data structures- linksVisited(this should be implemented as a hashmap or trai) and linksToBeVisited(this is a queue).

Web crawler uses BFS to traverse world wide web.

Algorithm of a basic web crawler:-

  1. Add one or more seed urls to linksToBeVisited. The method to add a url to linksToBeVisited must be synchronized.
  2. Pop an element from linksToBeVisited and add this to linksVisited. This pop method to pop url from linksToBeVisited must be synchronized.
  3. Fetch the page from internet.
  4. Parse the file and add any till now not visited link found in the page to linksToBeVisited. URL's can be filtered if needed. The user can give a set of rules to filter which url's to be scanned.
  5. The necessary information found on the page is saved in database or file.
  6. repeat step 2 to 5 until queue is linksToBeVisited empty.

    Here is a code snippet on how to synchronize the threads....

     public void add(String site) {
       synchronized (this) {
       if (!linksVisited.contains(site)) {
         linksToBeVisited.add(site);
         }
       }
     }
    
     public String next() {
        if (linksToBeVisited.size() == 0) {
        return null;
        }
           synchronized (this) {
            // Need to check again if size has changed
           if (linksToBeVisited.size() > 0) {
              String s = linksToBeVisited.get(0);
              linksToBeVisited.remove(0);
              linksVisited.add(s);
              return s;
           }
         return null;
         }
      }
    

kleopatra
  • 51,061
  • 28
  • 99
  • 211
alienCoder
  • 1,461
  • 3
  • 17
  • 23
  • Or you could simply use node.js asynchronously. – Totty.js Jun 13 '13 at 15:54
  • Here we are talking about large scale crawlers, javascript cannot be used for such a crawler. Best practice is c or c++ , java also works good. – alienCoder Apr 03 '14 at 10:37
  • Why are you saying that js is not scalable? Any proof you can show to me, please? – Totty.js Apr 03 '14 at 11:00
  • Come on, javascript is an interpreted, dynamic language which runs completely on web browser, so performance and scalability depends on browsers capability. If you create many threads the browser will freeze. Javascript is good for web applications (and for some toy programs) but not for large scale applications.If you want to write a toy crawler then it is fine, but when it comes to handle real world multithreaded applications (here u'll have to deal with TB's and PB's) then javascript cannot come even close to compiled languages. – alienCoder Apr 03 '14 at 14:23
  • I think that you didn't even heard about node.js: https://www.google.pt/search?q=node.js+linkedin – Totty.js Apr 03 '14 at 14:37
  • I am aware of node.js and its popularity for its scalability, but I think you missed my point -- still it runs on a web browser, so scalability, performance everything depends on the capability of a web browser. :) Can you give me a single large scale real world example where js has been used instead of compiled languages? – alienCoder Apr 04 '14 at 03:13
  • First of all node.js doesn't run in the browser, you don't need the browser to run node.js apps. Large scale real world examples (Linked in mobile, as told in my above answer): https://www.google.pt/search?q=Large+scale+real+world+examples+node.js, https://github.com/joyent/node/wiki/Projects,-Applications,-and-Companies-Using-Node, http://venturebeat.com/2011/08/16/linkedin-node/ – Totty.js Apr 04 '14 at 10:24
  • Now that's new theory :). So, u are telling me node.js does not runs on web browser and linked in is built on javascript. For ur information linkedin backend is java ( not js, there is a diff between java and js), front end used HTML5 and js http://www.slideshare.net/linkedin/linkedins-communication-architecture. – alienCoder Apr 04 '14 at 13:35
  • I don't know how to tell you but javascript can run both on server and on client side lol please wake up, man! Did you read what I've told you (http://venturebeat.com/2011/08/16/linkedin-node/, "On the server side, our entire mobile software stack is completely built in Node,")? What is node.js: http://nodejs.org/ . You must be kidding! – Totty.js Apr 04 '14 at 13:38
  • Good joke :) "Node.js is a platform built on Chrome's JavaScript runtime" -- nodejs.org. Wheather u run it on server side or client side it is runs on Google v8 ( fyi whuch is written in C++ :)). It is the C++ backend which gives it power, node.js uses power of C++ to make a scalable web apps. For web applications it is good, but crawler is not a web application. For a web crawler compiled languages are best choice. C++, Java is much much faster than node.js ( of course J2EE is slower than node.js in some cases, but for a web crawler we don't need J2EE). – alienCoder Apr 05 '14 at 05:05
  • Here are a few reasons we should not use node.js to make a real world web crawler ( of course u can make a toy crawler ) : – alienCoder Apr 05 '14 at 05:20
  • 1. node.js is single threaded. 2. It completely depends on V8. 3. Debugging JavaScript on Node.js is painful. 4.JavaScript is an interpreted language, code is slower than C++ . 5. For a webcrawler we don't need to make it a web application. 6. The Node.js API has a habit of changing in backwards-incompatible ways from release to release. – alienCoder Apr 05 '14 at 05:31
  • 7.JavaScript is a language with a beautiful core, but an absolutely anemic standard library. 8. lack of inherent code organization is a huge disadvantage etc. etc. A real life web crawler is not a simple program, it has to look after many issues. But for simple crawler node.js can be used. – alienCoder Apr 05 '14 at 05:32
5

Crawlers are simple in concept.

You get a root page via a HTTP GET, parse it to find URLs and put them on a queue unless they've been parsed already (so you need a global record of pages you have already parsed).

You can use the Content-type header to find out what the type of content is, and limit your crawler to only parsing the HTML types.

You can strip out the HTML tags to get the plain text, which you can do text analysis on (to get tags, etc, the meat of the page). You could even do that on the alt/title tags for images if you got that advanced.

And in the background you can have a pool of threads eating URLs from the Queue and doing the same. You want to limit the number of threads of course.

JeeBee
  • 17,476
  • 5
  • 50
  • 60
5

If your NPO's sites are relatively big or complex (having dynamic pages that'll effectively create a 'black hole' like a calendar with a 'next day' link) you'd be better using a real web crawler, like Heritrix.

If the sites total a few number of pages you can get away with just using curl or wget or your own. Just remember if they start to get big or you start making your script more complex to just use a real crawler or at least look at its source to see what are they doing and why.

Some issues (there are more):

  • Black holes (as described)
  • Retries (what if you get a 500?)
  • Redirects
  • Flow control (else you can be a burden on the sites)
  • robots.txt implementation
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • Can you please provide some insight into dealing with the issues you mention? In particular, black holes? – Shabbyrobe May 18 '09 at 04:20
  • The usual way out of black holes is programming a configurable limit for each domain or regex matching URL (ie, if URL matches this or domain is that, move on after 1000 retrieved matching pages). Flow control is implemented typically in pages per second per domain (usually they make you wait more than one second as to avoid being a burden). – Vinko Vrsalovic May 18 '09 at 05:00
4

Wikipedia has a good article about web crawlers, covering many of the algorithms and considerations.

However, I wouldn't bother writing my own crawler. It's a lot of work, and since you only need a "simple crawler", I'm thinking all you really need is an off-the-shelf crawler. There are a lot of free and open-source crawlers that will likely do everything you need, with very little work on your part.

Derek Park
  • 45,824
  • 15
  • 58
  • 76
2

You could make a list of words and make a thread for each word searched at google.
Then each thread will create a new thread for each link it find in the page.
Each thread should write what it finds in a database. When each thread finishes reading the page, it terminates.
And there you have a very big database of links in your database.

Gero
  • 1,842
  • 25
  • 45
0

Use wget, do a recursive web suck, which will dump all the files onto your harddrive, then write another script to go through all the downloaded files and analyze them.

Edit: or maybe curl instead of wget, but I am not familiar with curl, I do not know if it does recursive downloads like wget.

whatsisname
  • 5,872
  • 2
  • 20
  • 27
0

I'm using Open search server for my company internal search, try this : http://open-search-server.com its also open soruce.

Sathishkumar
  • 3,394
  • 4
  • 20
  • 23
0

i did a simple web crawler using reactive extension in .net.

https://github.com/Misterhex/WebCrawler

public class Crawler
    {
    class ReceivingCrawledUri : ObservableBase<Uri>
    {
        public int _numberOfLinksLeft = 0;

        private ReplaySubject<Uri> _subject = new ReplaySubject<Uri>();
        private Uri _rootUri;
        private IEnumerable<IUriFilter> _filters;

        public ReceivingCrawledUri(Uri uri)
            : this(uri, Enumerable.Empty<IUriFilter>().ToArray())
        { }

        public ReceivingCrawledUri(Uri uri, params IUriFilter[] filters)
        {
            _filters = filters;

            CrawlAsync(uri).Start();
        }

        protected override IDisposable SubscribeCore(IObserver<Uri> observer)
        {
            return _subject.Subscribe(observer);
        }

        private async Task CrawlAsync(Uri uri)
        {
            using (HttpClient client = new HttpClient() { Timeout = TimeSpan.FromMinutes(1) })
            {
                IEnumerable<Uri> result = new List<Uri>();

                try
                {
                    string html = await client.GetStringAsync(uri);
                    result = CQ.Create(html)["a"].Select(i => i.Attributes["href"]).SafeSelect(i => new Uri(i));
                    result = Filter(result, _filters.ToArray());

                    result.ToList().ForEach(async i =>
                    {
                        Interlocked.Increment(ref _numberOfLinksLeft);
                        _subject.OnNext(i);
                        await CrawlAsync(i);
                    });
                }
                catch
                { }

                if (Interlocked.Decrement(ref _numberOfLinksLeft) == 0)
                    _subject.OnCompleted();
            }
        }

        private static List<Uri> Filter(IEnumerable<Uri> uris, params IUriFilter[] filters)
        {
            var filtered = uris.ToList();
            foreach (var filter in filters.ToList())
            {
                filtered = filter.Filter(filtered);
            }
            return filtered;
        }
    }

    public IObservable<Uri> Crawl(Uri uri)
    {
        return new ReceivingCrawledUri(uri, new ExcludeRootUriFilter(uri), new ExternalUriFilter(uri), new AlreadyVisitedUriFilter());
    }

    public IObservable<Uri> Crawl(Uri uri, params IUriFilter[] filters)
    {
        return new ReceivingCrawledUri(uri, filters);
    }
}

and you can use it as follows:

Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));
Misterhex
  • 929
  • 1
  • 11
  • 19