2

I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.

At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.

My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?

shelf[hash(url)] = 1

Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?

Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Pranab
  • 2,207
  • 5
  • 30
  • 50
  • i think your idea of saving the hash is very good which can boost the performance a lot. but try to choose a hash algorithm which has low collision. – James.Xu Apr 03 '11 at 08:40
  • And the specific question is? –  Apr 03 '11 at 08:42

3 Answers3

2

Store the URLs in a set, which assures O(1) for finding items, and shelve it. At this amount of URLs, storing and restoring will take very little time and memory:

import shelve

# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
       'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
       'http://www.apani.com/', 'http://www.apple.com/',
       'http://www.arcoide.com/', 'http://www.areca.com.tw/',
       'http://www.argus-systems.com/', 'http://www.ariba.com/',
       'http://www.asus.com.tw/']

s=set(urls)                        # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp')  # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()

sh=shelve.open('/tmp/shelve.tmp')  # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/'    in s # False

This approach is quite fast:

import random
import string
import shelve
import datetime


urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
          for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start
Adam Matan
  • 128,757
  • 147
  • 397
  • 562
  • A shelve is not suitable for a large amount of data - especially when you often need to read and write. Using a database e.g. the ZODB is a much better solution here. –  Apr 03 '11 at 08:55
  • 40,000 string are not big enough for a database, in my opinion. – Adam Matan Apr 03 '11 at 09:02
  • Looking at [this answer](http://stackoverflow.com/questions/3525280/in-python-looking-for-an-alternative-to-shelve-too-slow-for-large-dictionaries) and your code, I think shelve should suffice for my purposes. – Pranab Apr 03 '11 at 09:18
1

Using a shelve is in general a bad idea for large amounts of data. A database is better suited with you have lots of data.

Options are:

  • ZODB (Python Object Database)
  • any RDBMS
  • noSQL world (like MongoDB which is easy approachable and very fast)
0

Hashing is a good idea. To search strings in data bases they use indexes. Since one can define a comparison operation on strings it's possible to build an index which is search tree and process each query with logarithmic complexity

Andrey Sboev
  • 7,454
  • 1
  • 20
  • 37