1

We have a large database of partial urls (strings) such as:

  • "example1.com"

  • "example2.com/test.js"

  • "/foo.js"

Our software listens for HTTP requests and tries to find one of our database's partial urls in the HTTP request's full url.

So we are getting full urls (i.e.: http://www.example.com/blah.js?foo=bar") and trying to match one of our database's partial patterns on it.

Which would be the best data structure to store our partial urls database on if all we care about is search speed?


Right now, this is what we do:

  • Iterating through the entire database of partial urls (strings) and using indexOf (in javascript) to see if the full url contains each partial string.

UPDATE:

This software is an extension for Firefox written in Javascript on Firefox's Addon SDK.

josesigna
  • 488
  • 3
  • 16
  • Is the number of partial urls' types fixed (as seen in the example, it's domain and path only)? I'd go with array of hashes in this case, checking each chunk of the analyzed URL against the corresponding hash separately. – raina77ow Aug 13 '13 at 21:16
  • Why don't you record the number of hits to each and sort the search by that? – Paul Aug 13 '13 at 21:18
  • Does the database fit in the RAM? – pkacprzak Aug 14 '13 at 09:28
  • Are your partial strings only domain names or page names (example.com or foo.js or bar.html)? – stan0 Aug 14 '13 at 11:29
  • @raina77ow: yes, the number of partial urls on our database are either sub-domain.domain, domain, sub-domain.domain/path, domain/path, path by itself. – josesigna Aug 19 '13 at 21:37
  • @Paul: that's a good idea but we see too many different HTTP requests for this improvement to make sense. – josesigna Aug 19 '13 at 21:41

2 Answers2

1

Assuming that your partial strings are only domain names and/or page names you could try to generate all possible combinations from the URL starting from the end:

 http://www.example.com/blah.js?foo=bar
 blaj.js
 example.com/blah.js
 www.example.com/blah.js

Then hash all the combinations, store them in an array and try to find any of them in another array that contains the hashes of all your partial strings from the database.

NOTE:

In case you want to match ANY string in the url, like ample in example.com it becomes little complicated in terms of storage, because all random combinations of strings in an url are Combination formula

where n is the length of the url and k is the length of the string to find. According to this SO question the maximum reasonable length of a url is 2000 characters. And assuming that you want to match random string you'd have k vary from 1 to 2000 which would result in a large amount of hashes generated from the url - Sum of n over k for each k from 1 to 2000. Or more precisely - 2000! / (k!*(2000-k)!) different hashes

Community
  • 1
  • 1
stan0
  • 11,549
  • 6
  • 42
  • 59
  • our partial strings are not only domain names, they are sub-domain.domain, domain, sub-domain.domain/path, domain/path, path by itself. – josesigna Aug 19 '13 at 21:49
  • This is good. My point was that this method is worth when you don't have to hash EVERY possible combination. In your case you only have to hash the path (/path/file.ext ?), domain/path, domain, sub-domain/domain and sub-domain.domain/path – stan0 Aug 20 '13 at 05:29
0

There are a few things you could do:

  • Don't process the URLs on the client side. JavaScript is going to be slow, especially if you have a lot of these URLs. You can create a REST API and pass in the URL for matching as a query parameter i.e. domain.com/api/?url=.... Placing the heavy lifting and memory use on the server side will also decrease your bandwidth.
  • Bootstrap the URLs into RAM and don't read from the database every time. Something like memcached could work perfectly in this case.
  • Once in ram, a HashTable structure would work the best since you are doing simple matching. Whatever you do, avoid string comparison.

If you follow these few suggestions, you would have a significant speedup. Hope this helps.

Zorayr
  • 23,770
  • 8
  • 136
  • 129
  • The software is a browser extension so the processing is happening live on the client side for each HTTP request. So we cannot make use of the server side. Regarding the HashTable, how would you go about the fact that we have full urls incoming to match against a db of partial/incomplete urls? – josesigna Aug 19 '13 at 21:46
  • Doesn't your extension have access to Internet? – Zorayr Aug 20 '13 at 20:12
  • Yes; however, our extension uses this "matching" system to determine whether or not to allow the HTTP request to be sent, and thus having to wait for a response from our server would make loading any website extremely slow. And this would happen for every HTTP request. – josesigna Aug 21 '13 at 15:05