0

I am building a web crawler which has to crawl hundreds of websites. My crawler keeps a list of urls already crawled. Whenever crawler is going to crawl a new page, it first searches the list of urls already crawled and if it is already listed the crawler skips to the next url and so on. Once the url has been crawled, it is added to the list.

Currently, I am using binary search to search the url list, but the problem is that once the list grows large, searching becomes very slow. So, my question is that what algorithm can I use in order to search a list of urls (size of list grows to about 20k to 100k daily).

Crawler is currently coded in Python. But I am going to port it to C++ or other better languages.

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
Dr. Sahib
  • 156
  • 10
  • 2
    Why is java tagged? Also, maybe read up on the Trie. – nbryans Jun 23 '16 at 17:20
  • 1
    Given that your list is already sorted since you are using binary search, I don't think you can have a better solution than binary search. Have you tried timing the compute-intensive parts of the program. My guess is that the bottleneck is probably not the search algorithm but the sort one? – user3813674 Jun 23 '16 at 17:23
  • You can always try using a dictionary- dict lookups are very efficient since they hash instead of checking for string matches (which are really bad since URL's will usually start with several of the same). A hash search will be quicker since string comparison is slow anyways. – Delioth Jun 23 '16 at 17:25
  • If you have enough memory, you can hash the url then pickle it – galaxyan Jun 23 '16 at 17:25
  • Binary search has failed miserably, especially because most part of the URL remains the same (e.g. https, www and domain name) – Dr. Sahib Jun 23 '16 at 17:26
  • @galaxyan what purpose does the pickling fill? – Delioth Jun 23 '16 at 17:27
  • 100k strings? That's nothing. But why are you using a list instead of a set? – Stefan Pochmann Jun 23 '16 at 17:28
  • @Delioth every time after done with running code, it needs to store in somewhere – galaxyan Jun 23 '16 at 17:28
  • @StefanPochmann Its a binary search list (sorted) – Dr. Sahib Jun 23 '16 at 17:30
  • @Dr.Sahib But why? That just seems silly. – Stefan Pochmann Jun 23 '16 at 17:31
  • 1
    @StefanPochmann Its silly, that's why I am asking a question on stack overflow – Dr. Sahib Jun 23 '16 at 17:37
  • have a look at the experiments I did with lookup speeds: http://stackoverflow.com/questions/36392583/is-an-unordered-map-really-faster-than-a-map-in-practice TL;DR - use an unordered_map, every time. – Richard Hodges Jun 23 '16 at 17:41

2 Answers2

3

You have to decide at some point just how large you want your crawled list to become. Up to a few tens of millions of items, you can probably just store the URLs in a hash map or dictionary, which gives you O(1) lookup.

In any case, with an average URL length of about 80 characters (that was my experience five years ago when I was running a distributed crawler), you're only going to get about 10 million URLs per gigabyte. So you have to start thinking about either compressing the data or allowing re-crawl after some amount of time. If you're only adding 100,000 URLs per day, then it would take you 100 days to crawl 10 million URLs. That's probably enough time to allow re-crawl.

If those are your limitations, then I would suggest a simple dictionary or hash map that's keyed by URL. The value should contain the last crawl date and any other information that you think is pertinent to keep. Limit that data structure to 10 million URLs. It'll probably eat up close to 2 GB of space, what with dictionary overhead and such.

You will have to prune it periodically. My suggestion would be to have a timer that runs once per day and cleans out any URLs that were crawled more than X days ago. In this case, you'd probably set X to 100. That gives you 100 days of 100,000 URLs per day.

If you start talking about high capacity crawlers that do millions of URLs per day, then you get into much more involved data structures and creative ways to manage the complexity. But from the tone of your question, that's not what you're interested in.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Question says *"grows **to** about 20k to 100k daily"*, though, not *"grows by"*. – Stefan Pochmann Jun 23 '16 at 17:34
  • @StefanPochmann: I initially thought the meaning was "grows by". Probably because I couldn't understand why one would worry about a number as small as 100 K. It that I misunderstood and that the OP really was asking about how to store a list of up to 100K URLs and efficiently search it. – Jim Mischel Jun 23 '16 at 18:03
  • @JimMischel What I actually meant is that we crawl hundreds of big websites (ebay etc), sometimes we get 20k new pages, sometimes as much as 100k. And then these new pages after getting crawled are added to the list. So the correct word will be "Grows by 20k to 100k". – Dr. Sahib Jun 23 '16 at 19:13
-1

I think hashing your values before putting them into your binary searched list- this will get rid of the probable bottleneck of string comparisons, swapping to int equality checks. It also keeps the O(log2(n)) binary search time- you may not get consistent results if you use python's builtin hash() between runs, however- it is implementation-specific. Within a run, it will be consistent. There's always the option to implement your own hash which can be consistent between sessions as well.

Delioth
  • 1,564
  • 10
  • 16
  • And there's the problem of hash collisions. You'll need a 64 bit hash. The number of collisions with a 32 bit hash would be terrible after just a few million URLs. – Jim Mischel Jun 23 '16 at 17:35