Very basic persistence or data storage for a Python script

Question

I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.

At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.

My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?

shelf[hash(url)] = 1

Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?

Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?

i think your idea of saving the hash is very good which can boost the performance a lot. but try to choose a hash algorithm which has low collision. — James.Xu, Apr 03 '11 at 08:40

Adam Matan · Accepted Answer · 2011-04-03T10:31:45.673

Store the URLs in a set, which assures O(1) for finding items, and shelve it. At this amount of URLs, storing and restoring will take very little time and memory:

import shelve

# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
       'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
       'http://www.apani.com/', 'http://www.apple.com/',
       'http://www.arcoide.com/', 'http://www.areca.com.tw/',
       'http://www.argus-systems.com/', 'http://www.ariba.com/',
       'http://www.asus.com.tw/']

s=set(urls)                        # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp')  # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()

sh=shelve.open('/tmp/shelve.tmp')  # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/'    in s # False

This approach is quite fast:

import random
import string
import shelve
import datetime


urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
          for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start

A shelve is not suitable for a large amount of data - especially when you often need to read and write. Using a database e.g. the ZODB is a much better solution here. — , Apr 03 '11 at 08:55
40,000 string are not big enough for a database, in my opinion. — Adam Matan, Apr 03 '11 at 09:02
Looking at [this answer](http://stackoverflow.com/questions/3525280/in-python-looking-for-an-alternative-to-shelve-too-slow-for-large-dictionaries) and your code, I think shelve should suffice for my purposes. — Pranab, Apr 03 '11 at 09:18

score 1 · Answer 2 · answered Apr 03 '11 at 08:56

Using a shelve is in general a bad idea for large amounts of data. A database is better suited with you have lots of data.

Options are:

ZODB (Python Object Database)
any RDBMS
noSQL world (like MongoDB which is easy approachable and very fast)

score 0 · Answer 3 · answered Apr 03 '11 at 08:43

0

Hashing is a good idea. To search strings in data bases they use indexes. Since one can define a comparison operation on strings it's possible to build an index which is search tree and process each query with logarithmic complexity

answered Apr 03 '11 at 08:43

Andrey Sboev

7,454
1
20
37

Very basic persistence or data storage for a Python script

3 Answers3