I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.
At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.
My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?
shelf[hash(url)] = 1
Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?
Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?