Crawler url queue or hash list?

Question

I'm re-writing the spidering/crawler portion of a Delphi 6 site mapper application that I previously wrote. The app spiders a single site.

I need to manage two aspects of this:

A Queue for URLs to scan, first in, first out.
A Scanned list of URLs so that links from a new page are not added to the queue if they were already visited. This list would need to be searched.

Previously these were done with a TList and a StringList respectively. Obviously the performance of these degraded on sites with thousands of links.

My question is, what should be used for these queues/list to ensure the best performance? I have little experience with hashes.

One site with thousands of pages? That's a big site! Are you re-writing in Delphi 6, or a more recent version of Delphi? If it's a more recent version, you might try TDictionary for #2, which I believe is faster. Others, better informed, will no doubt clarify. — RobertFrank, Jul 28 '11 at 13:24
Thousands of strings in a TStringlist does not sound very much - is the string list sorted (iirc this helps to reduce search time)? — mjn, Jul 28 '11 at 13:39
Oh, and maybe you should use threads to follow several links in parallel. The filter list of visited pages should be only one instance then (obviously). — mjn, Jul 28 '11 at 13:41
Does it really need to be FIFO queue for URLs? Because it severly limits the options or you have to make extra effort to keep track of the order of them. — ain, Jul 28 '11 at 14:33
The queue probably doesn't need to be FIFO. The new code will utilize threads, safely accessing the single list of queued urls and visited urls. Some of the sites that people try to site map have 40,000 pages (products and such). Yes, it's retarded to sitemap that, but they want to. LOL — MikeD, Jul 28 '11 at 15:57
I suggest you to study internals of `TMemIniFile` for THashedStringList — Premature Optimization, Jul 30 '11 at 12:23

score 5 · Answer 1 · edited May 23 '17 at 12:26

5

IMHO hashes will be the best candidate for such lists.

In Delphi 6, you can use the THashedStringList class available in the IniFiles unit. It will be faster than a TStringList.

Notice that if you use a sorted TStringList, you could use much faster binary search, fast enough for your purpose.

For something more complete, you may take a look at those OpenSource libraries:

TSynBigTableMetaData to store any amount data (in your case HTML pages) associated with metadata fields - you have indexes for metadata fields, so adding and retrieval will be quick;
A dynamic array, with the name using a hash can be used in Delphi 6 with TDynArrayHashed.

Update:

Just a trick for sorting URIs, if you use a sorted TStringList: you may better use a backward sort function, i.e. compare the URI text beginning from the end of the string instead of the beginning, since in URI, the change is in the suffix rather than in the prefix. You'll make the sort / binary search faster.

edited May 23 '17 at 12:26

Community

1
1

answered Jul 28 '11 at 14:56

Arnaud Bouchez

42,305
3
71
159

TSynBigTableMetaData looks to be quite interesting, but I could not find any documentation anywhere, just random code samples in the forum. – MikeD Jul 28 '11 at 15:53
1

@Jambog The unit is self-documented. All classes and methods have their exact description in the interface part. Take a look at those description, and you'll find your way out. You can use our SynProject tool to create a pdf documentation from the comments, if you need to. – Arnaud Bouchez Jul 28 '11 at 19:34
Arnaud, Synopse Big Table looks to be a really cool piece of code, however I think you're disillusioned if you think it's self documenting. As just one example, it as a Search function, but code samples in the forums show looping, and even they are cryptic with stuff like "for i := n-200 to n-1 do". – MikeD Jul 29 '11 at 10:15
Of course, general documentation is to be written. About the search methods, and source code sample, take a look at the benchmark and regression tests functions and methods available in the unit, in benchmark function TestBigTable and regression class TTestBigTable. Worth taking a look. The `n-200 to n-1` are just loop to test multiple times about any regression issue. – Arnaud Bouchez Jul 29 '11 at 15:36

score 3 · Answer 2 · answered Jul 28 '11 at 14:39

3

Trie's work great for storing large (unique) text blocks and retaining high speed searching. A while back I did a quick and dirty article for Pascal Gamer about them: http://www.pascalgamer.com/issue_details.php?i=1 that might be worth a read.

Basic concept is to create a record (class, whatever) that contains a letter or symbol and all its linked letters and symbols as children. These children are stored sorted so a quick binary search can be used to find the next node. When you reach the end of your input you can tell if your at a word end or valid position.

Nice thing about Trie's you can do partial matching, reverse lookups, skip searches, and etc without any problems. Down sides are; you can't have duplicate entries easily, they take more space on SMALLER data sets, and depending on your implementation case sensitive switching can be "interesting".

Use the concept day in and day out on millions of records with no issues and high speed retention.

answered Jul 28 '11 at 14:39

jdarling

2,438
2
14
10

hashes are usually faster than tries, because they are O(1) by design, if the hash function is well defined – Arnaud Bouchez Jul 28 '11 at 19:36
Depending on your goal and your hash function they can be faster. Of course, you still have the hash overhead to take into account. The advantage that I see of the Trie over a Hash is with regards to what I said above; Partial Matches, Reverse Lookups, Skip Searches, etc... IE where you want to do more than just match 1:1 (common in URL matching) – jdarling Jul 28 '11 at 20:16
You're perfectly right. But as you stated, the OP question was about just matching 1:1. In this case, hashes are faster than tries. About the hash function itself, I found http://www.strchr.com/hash_functions worth reading. And used an optimized pipelined crc32 as hash function in my libraries, when hashing some text. – Arnaud Bouchez Jul 29 '11 at 07:13
Thanks for the link, that's one I didn't have in my Diigo yet :). My main reference for Hash's is still Julian Bucknall's "The Tomes of Delphi: Algorithms and Data Structures" great book to have around even today. – jdarling Jul 29 '11 at 13:49
Julian's book is indeed the one to be read and understood. About hashing function, only a few were presented. – Arnaud Bouchez Jul 30 '11 at 06:25

Crawler url queue or hash list?

2 Answers2