-1

I have currently a dictionary with nested dictionaries. Its length is approx 2 millions. The dictionary looks like this, but this is a fake example

{ "item 1" : { "name" : "John Doe", "address" : "#1 city, zip, whatever"},
  "item 2" : { "name" : "Jane Doe", "address" : "#2 city, zip, blablabla"},
 ...}

My task is to get the first n items where the "address" field in the nested dict contains a string, where n let's say 10. This must be very efficient, responding in ms on a powerful desktop. Tried loop with iterators with exception handling, but was too slow. Dict comprehension iterates over every elements, so also slow. Then I created an index dictionary where the key was the address and the value was a list of items (key of the original dict). Then iterated through and stopped after n items. Something like this (dict_2):

{"#1 city, zip, whatever" : ["item 1", "item 5487", ...],
 "#2 city, zip, whatever" : ["item 2", "item 1654654", ...] }
result = []
i = 0
for k,v in dict_2.items():
    if findThis in k:
        i += 1
        result.extend(v)
        if i>= n:
            break

Quite ok, but I still need some improvement, because python loops are not as fast as I need. Compehension does not break after n match.

I can accept any kind of solutions (series, list, dict, hashmap, etc.), but the goal is: response time as low as possible; result is a list of keys of the original dictionary.

Thank you in advance!

spyder
  • 115
  • 2
  • 11
  • What is the order of magnitude that `n` is on? What's the current running time and what do you need to get it down to? – IanQ Jan 24 '19 at 20:25
  • Did you try to dump the dictionary into a string and regex it? – Shirkan Jan 24 '19 at 20:36
  • When you say "first 10 items" for example, do you mean we also need to sort them? Sort by the key, looks like? This really sounds like a great use of a database, if you're talking 2 million rows, and perform indexed searches. – Kenny Ostrom Jan 24 '19 at 21:23
  • @Ian Quah: the lenght of the index dictionary is approx 1-1.5 million items. Basically this the search API in a webapp where any keystroke invokes the API (like google search box). Currently used by few people parallel now, but expecting few hundreds. The search time is aroung 150-250 ms for a single search, depending on the position of the of the matching items in the dict if any. Due to parallel requests, low memory consumption is also important. Fortran, C++, Cython can be also options, but first I need something very efficient in Python which algo can be compiled into binary later. – spyder Jan 24 '19 at 21:23
  • The specific requirement "substring search" is very hard to do efficiently. If you can restrict that to searchable tokens (e.g. whole words) you can index. Otherwise, we're probably talking huge suffix trie? – Kenny Ostrom Jan 24 '19 at 21:27
  • @Kenny Ostrom: no, no need to sort, just give the first 10 if exists or what you got. It is a search feature to help the user to find something close what she/he is looking for without knowing the exact item. It can be an address, book title, anything, but basically there are very few completely similar items. Part of them can be similar for hundred times (i.e. New York may appear in several items). I thought that but New will be in thousands of items, while New York in hundres and not sure if mixing the results of New and York is efficient. Or not sure how to do this right now. – spyder Jan 24 '19 at 21:27
  • You mean find any 10. I heard "first" and think that means they are in some order. – Kenny Ostrom Jan 24 '19 at 21:29
  • Noup. First means... 10 items the algo finds when processing, then stops and pushes the result back, so no need to process the entire dataset for all matching and than slice it. – spyder Jan 24 '19 at 21:30
  • @Shirkan: not yet, but will see. So you have ment something like this: "#1 address, etc [item1, item x, item y], #2 address, blablabla [item a, item b],..." and to find "findThis"[^[]*\[.*\]", then process the list of items in []? – spyder Jan 24 '19 at 21:36

1 Answers1

0

According to this answer, I found this link to a module called suffix_trees. According to the description there.

A suffix tree is a useful data structure for doing very powerful searches on text strings. For example, it's probably possible to design a Python dictionary interface that accepts substrings of keys, and return a list of possible keys. Very very cool stuff. (I wonder if this is what Perl's study function does.) SuffixTree is a wrapper that allows Python programmers to play with suffix trees.

From what I'm seeing it doesn't subclass dict so you would need to re-iterate over the dict to create the SubstringDict then it should be faster to search substrings within dict keys.

Something like the following

from copy import deepcopy
from SubstringDict import SubstringDict

dict_2 = {"#1 city, zip, whatever" : ["item 1", "item 5487", ...],
"#2 city, zip, whatever" : ["item 2", "item 1654654", ...] }

def find_substring(sub_str_dict, substring, big_o):
    
    for k,v in dict_2.items():
        lookup = v[substring]
        if  lookup:
            big_o += 1
            yield lookup
        if i>= n:
            return StopIteration

def make_sub_str_dict():
    d = deepcopy(dict_2)
    for k,v in dict_2.items():
        d[k] = SubstringDict()
        for inner_k,inner_v in v: d[k][inner_k]=inner_v
    return d

dict_2_search = make_sub_str_dict(dict2)

#now you can search for a substring
print(next(find_substring('whatever you are looking for')))

Note: I have not tested this code it's a mockup, I'm away from my workstation. So please be sure to check.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Jab
  • 26,853
  • 21
  • 75
  • 114
  • Thanks! Will check in the office tomorrow :-) – spyder Jan 24 '19 at 21:40
  • No problem, You could possibly even edit the module I attached and make it a `dict` subclass so you don't need to iterate over the original dict – Jab Jan 24 '19 at 21:42
  • Just had some time to pay with. I chosed a C++ implementation of suffix trees over pure python lib from performance point of view. This is a bit different than your sample above, so still need some time to test. Also, I need to take case for memory consumption. So, will see. – spyder Jan 28 '19 at 13:21
  • You have my interest, could you post a link to the C++ lib you're using? – Jab Jan 28 '19 at 20:43