11

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.

Amit
  • 614
  • 1
  • 8
  • 18

4 Answers4

13

Performance comparison

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith -functions

Setup

The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.

Three types of a search string:

  • non_existent_key - there is no match for the string
  • rare_key - around 20 in a million
  • frequent_key - number of occurrences is comparable to the collection size

Results

Maximum memory consumption for a million urls:
| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

To reproduce the results, run the trie benchmark code.

  • rare_key/nonexistent_key case

    if number of urls is less than 10000 then datrie is the fastest, for N>10000 - suffixtree is faster, startwith is significally slower on average.

rare_key

  • axes:

    • vertical (time) scale is ~1 second (2**20 microseconds)
    • horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).

nonexistent_key

  • frequent_key

    Upto N=100000 datrie is the fastest (for a million urls the time is dominated by the trie construction time).

    The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.

frequent_key

startswith - time performance is independent from type of key.

trie and pytrie behave similar to each other.

Performance without trie construction time

  • datrie - the fastest, decent memory consumption

  • startswith is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.

  • datrie, pytrie, trie - almost O(1) (constant time) for rare/non_existent key

rare_key_no_trie_build_time nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Excellent comparison. Can I ask what tools you used to measure the stats and to produce the charts? I wish I'd had the time and energy to implement radix tree so it could've been included in the comparison. – Stephen Paulger Mar 29 '11 at 22:02
  • @StephenPaulger: `timeit` module to measure stats and [make-figures.py](https://gist.github.com/235404) script to plot simple `*.xy` files (e.g., `pytrie-non_existent_key-7-1000000.xy`). The script originated at [Compare sorting algorithms' performance](http://rosettacode.org/wiki/Measure_relative_performance_of_sorting_algorithms_implementations) – jfs Mar 30 '11 at 00:58
  • @MikhailKorobov: I need `longest_match()` function for comparison. [Is it correct usage of datrie?](https://gist.github.com/3100522) – jfs Jul 12 '12 at 19:56
  • Yes, it is correct usage. I think trie should be created once though. Thanks! – Mikhail Korobov Jul 13 '12 at 11:56
  • 1
    @MikhailKorobov: I've dumped [the code for the benchmark at github](https://github.com/zed/trie-benchmark). You can run it yourself. I'll see if I manage to debug the timing procedure (I don't understand the results) then I'll update the answer with a new results. – jfs Jul 13 '12 at 15:12
  • 1
    @MikhailKorobov: I've figured it out. The old results correspond to "Performance without trie construction time" case above. I've added `datrie` to comparison. – jfs Jul 13 '12 at 20:57
  • @J.F. Sebastian: thanks a lot! By the way, I think memory consumption for all trie packages may be a bit better: they don't store the keys as-is so they don't need hosts list if they support saving/loading from disk; "startswith" approach needs this list on the other hand. It is also probably better to disable garbage collection while measuring execution time. – Mikhail Korobov Jul 14 '12 at 03:44
12

This example is good for small url lists but does not scale well.

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

An implementation using the trie module.

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

Result:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

or using PyTrie which gives the same result but the lists are ordered differently.

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

Radix tree of example URLs

Whereas the trie looks more like: trie of example URLs

Stephen Paulger
  • 5,204
  • 3
  • 28
  • 46
  • Thanks a lot for the reply, but I am not looking for a regular expression kind of solution since it is not scalable as the number of different URL's increase. What I am looking for is Trie based solution for longest prefix match where the strings are URL's. – Amit Mar 27 '11 at 03:34
  • Probably `return ''` is more appropriate here instead of `raise Exception` – jfs Mar 27 '11 at 08:01
  • @Amit fair enough. How are you storing your list of URLs? If you have them in a database indexing the URL would probably provide an easy and efficient solution. @J.F. Sebastian ok thanks. – Stephen Paulger Mar 27 '11 at 18:42
  • @Stephen I am not storing them in database, I have a list of URL's in with unique random-number associated with it, now I would like to store it in a trie and then match a new URL and find out the closest prefix-match – Amit Mar 28 '11 at 06:37
  • +1 for the `pytrie` version. You could use `url_trie.keys` instead of `url_trie.values` [in that case the url_trie can be constructed as `url_trie = StringTrie.fromkeys(url_list)`]. – jfs Mar 28 '11 at 21:12
  • Yeah, I noticed that but @amit said that he has unique numbers for values, so thought I'd use values so he can replace the url value with his unique number. – Stephen Paulger Mar 28 '11 at 21:46
  • @Stephen thanks a lot Stephen and Sebastian. Though I explored this package but didn't know how to use the hidden method and so thought it can't be used for my problem – Amit Mar 29 '11 at 08:50
  • I think the problem might have been that longest prefix match is the almost the opposite of what you're doing. So if you had ['da', 'de', 'di', 'du'] in a trie, an example of LPM is searching for 'duck' which would return 'du'. – Stephen Paulger Mar 29 '11 at 08:56
  • I've added a performance comparison suffixtree vs. pytrie vs. trie vs. startswith -functions http://stackoverflow.com/questions/5434813/longest-prefix-matches-for-urls/5479374#5479374 – jfs Mar 29 '11 at 21:52
  • 1
    [I've added comparison with `datrie`. It faster and consume much less memory than `pytrie`.](http://stackoverflow.com/a/5479374/4279) – jfs Jul 13 '12 at 20:55
1

The function below will return the index of the longest match. Other useful information can easily be extracted as well.

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
  • This returns an index for searches that should match nothing. – Stephen Paulger Mar 25 '11 at 17:08
  • Good point, it's not checking for the zero-length match. Didn't have time to polish it. – Tom Zych Mar 25 '11 at 17:11
  • Thanks a lot for the reply, but I am not looking for a regular expression kind of solution since it is not scalable as the number of different URL's increase. What I am looking for is Trie based solution for longest prefix match where the strings are URL's. – Amit Mar 27 '11 at 03:52
1

If you are willing to trade RAM for the time performance then SuffixTree might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.

If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating SubstringDict():

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

Such usage of SuffixTree seems suboptimal but it is 20-150 times faster (without SubstringDict()'s construction time) than @StephenPaulger's solution [which is based on .startswith()] on the data I've tried and it could be good enough.

To install SuffixTree, run:

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • This is a better solution than mine by far. It's a shame there's no tree implementations in the standard python library. Can SuffixTrees be serialised or is generating them so quick that it doesn't matter if you recreate them? – Stephen Paulger Mar 28 '11 at 09:35
  • @Sebastian This library is doing suffix matching and not prefix,which I require. I am not sure how can I use it for my task, So if i build a Suffix tree (st) using these st['http://www.abcd.com/']=1 ,st['http://www.abcd.com/news']=2, st['http://www.abcd.com/sports']=3, and then search for 'http://www.abcd.com' by doing st['http://www.abcd.com'] i get {1,2,3} as answer,while it should return only 1 as the prefix match – Amit Mar 28 '11 at 10:08
  • @StephenPaulger: `SuffixTree` is a simple SWIG wrapper for [strmat](http://www.cs.ucdavis.edu/~gusfield/strmat.html) C-library. I doubt that there is a serialization support but it certainly would be useful. `SuffixTree` solution doesn't make sense if you regenerate it from start for every new `url_prefix`. – jfs Mar 28 '11 at 10:25
  • @Amit: `SubstringDict` does *substring* matching. You could use a unique prefix such as `'\n'` as explicitly said in my answer. Use Markdown formatting (click the `help` under 'Add Comment') to avoid corrupting code fragments. – jfs Mar 28 '11 at 10:40
  • @Sebastian : Thanks for your help, but the method you have mentioned is failing for "prefix" match – Amit Mar 28 '11 at 15:17
  • @Amit: Could you provide an example when it fails? – jfs Mar 28 '11 at 15:28
  • @Sebastian Please see the example below : >>> d2 = SuffixTree.SubstringDict() >>> a='\n'+'http://www.abcd.com/' >>> d2[a]=1 >>> a='\n'+'http://www.abcd' >>> d2[a]=2 >>> d2[a] >>>[1, 2] the answer I want is only 2 – Amit Mar 29 '11 at 04:50
  • @Amit: Use Markdown formatting (click the help under 'Add Comment') to avoid corrupting code fragments. Both `a` values start with `http://www.abcd` (have the same *prefix*) therefore the answer provided by `SubstringDict` is correct `[1, 2]`. – jfs Mar 29 '11 at 07:16