3

So I want to determine the words in a given string. These strings are domain names. I have approximately 5000 domain names and a dictionary of 60000 dictionary words to check. This would result in checking 60000 times per domain, totalling to approximately 300.000.000 operations, which is just madness.

Therefore I would like to ask if there is a smarter way to solve this problem to still get the words present in the string.

I've tried to do it with a simple loop, but I guess this requires a smarter solution to work with the immense quantity of checks.

dictionary_of_words = ["I", "Stack", "overflow", "like", etc]
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", etc]

def in_dictionary(AllDomains):
    #Setting a new column
    AllDomains["dictionary"] = False
    AllDomains["words"] = None

    for i in range(len(AllDomains)):
        # Scan if the entire word is in the dictionary
        if AllDomains["domain"].str.strip(".nl").str.lower().iloc[i] in dictionary_of_words:
            AllDomains["dictionary"].iloc[i] = True
            print(AllDomains["domain"].iloc[i])

        # Scan which words there are in the domain
        else:
            for word in dictionary_of_words:
                print(word)
                if word in AllDomains["domain"].str.strip(".nl").str.lower().iloc[i]:
                    if AllDomains["words"].iloc[i] == None:
                        AllDomains["words"].iloc[i] = word
                    else:
                        AllDomains["words"].iloc[i] = AllDomains["words"].iloc[i] + f", {word}"

                    print(AllDomains["domain"].iloc[i])

    return AllDomains
Sil Bouwman
  • 101
  • 1
  • 8
  • take a look at the python dictionary data structure. – jdigital Sep 21 '19 at 21:35
  • Well, the algorithm for pattern matching is [Aho-Corasick](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) First the dictionary is pre-processed into a trie. And then you feed it the strings. – conditionalMethod Sep 21 '19 at 23:27

4 Answers4

2
  1. Put all the words in a set (not a list), so that you can very quickly check a string to see if it matches any word.
  2. Put all the lengths of words in a set.
  3. For each length, for each domain name, get all the substrings of that length, and check to see if it's in the word set.

Since you'll probably end up with about 10 lengths, and most domains will be less than 20 chars long, this will result in only couple hundred quick checks per domain name.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
1

I don't really see another solution, but you could simplify your code with:

dictionary_of_words = ["I", "Stack", "overflow", "like", 'etc']
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", 'etc']

# this is just to be sure they have the same case
dictionary_of_words = list(map(str.lower, dictionary_of_words))
AllDomains = list(map(str.lower, AllDomains))

for word in dictionary_of_words:
    matches = [domain for domain in AllDomains if word in domain]
    print(word, matches)

I have added the conversion to lower case just to have more matches with the sample data, not sure if this is necessary for you, but domain names are always lower case.

Also 600000 is a lot of words, are you sure that there are no duplicates there?

mucio
  • 7,014
  • 1
  • 21
  • 33
1

What you need is an algorithmic solution rather than a python based solution.

So I am suggesting an algorithm called Aho-Corasick string matching algorithm. Using this algorithm you can scan them all at once. (But there will be little overhead when preparation )

The algorithm searches a few patterns at the same time for a given string. In your case the algorithm search all dictionary_of_words for "stackoverflow.com". So we can quite easily use a for loop to search through all domains.

So this is a code I wrote for your requirement. Source for the algorithm


class AhoNode:
    def __init__(self):
        self.goto = {}
        self.out = []
        self.fail = None

def aho_create_forest(patterns):
    root = AhoNode()

    for path in patterns:
        node = root
        for symbol in path:
            node = node.goto.setdefault(symbol, AhoNode())
        node.out.append(path)
    return root


def aho_create_statemachine(patterns):
    root = aho_create_forest(patterns)
    queue = []
    for node in root.goto.itervalues():
        queue.append(node)
        node.fail = root

    while len(queue) > 0:
        rnode = queue.pop(0)

        for key, unode in rnode.goto.iteritems():
            queue.append(unode)
            fnode = rnode.fail
            while fnode != None and not fnode.goto.has_key(key):
                fnode = fnode.fail
            unode.fail = fnode.goto[key] if fnode else root
            unode.out += unode.fail.out

    return root


def aho_find_all(s, root, callback):
    node = root

    for i in xrange(len(s)):
        while node != None and not node.goto.has_key(s[i]):
            node = node.fail
        if node == None:
            node = root
            continue
        node = node.goto[s[i]]
        for pattern in node.out:
            callback(i - len(pattern) + 1, pattern)

############################
# Demonstration of work
def on_occurence(pos, patterns):
    print ("At pos %s found pattern: %s" % (pos, patterns))



AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com"]
patterns = ["I", "Stack", "overflow", "like"]

root = aho_create_statemachine(patterns)
for i in AllDomains:
  print(i)
  aho_find_all(i, root, on_occurence)
  print ""

This gives results similar to

stackoverflow.com
At pos 5 found pattern: overflow

iLikeStackoverflow.com
At pos 5 found pattern: Stack
At pos 10 found pattern: overflow
  • Great! Thank you this sounds great. I will try it out tonight to see how long it takes me now – Sil Bouwman Sep 23 '19 at 11:08
  • Sure let me know your timing. I'm curious too. :-) – Tharaka Ratnayake Sep 23 '19 at 18:03
  • FYI, there are existing, C accelerated packages for Aho-Corasick out there, e.g. [`pyahocorasick`](https://pypi.python.org/pypi/pyahocorasick) that should perform better, and require less custom code for you to maintain. The code in this answer is Py2 only, among other issues. – ShadowRanger Sep 23 '19 at 18:53
  • I get this error while trying: File "Models/inDictionary.py", line 47, in aho_create_statemachine for node in root.goto.itervalues(): AttributeError: 'dict' object has no attribute 'itervalues' – Sil Bouwman Sep 24 '19 at 11:55
  • It's funtion in python 2.7. Please let me know if you are using python 3.6 – Tharaka Ratnayake Sep 24 '19 at 13:23
0

My solution is to mix all the domains into a large text soup. Then you just need to loop 600,000 times for the words. Hope it helps.

import re

dictionary_of_words = ["I", "Stack", "overflow", "like", "etc"]
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", "etc"]

domain_soup = " ".join(AllDomains).lower()

word_in_domain = {word: re.findall(r"[\S]*{}[\S]*".format(word.lower()), domain_soup) for word in dictionary_of_words}

Output:

{'I': ['ilikestackoverflow.com'],
 'Stack': ['stackoverflow.com', 'ilikestackoverflow.com'],
 'overflow': ['stackoverflow.com', 'ilikestackoverflow.com'],
 'like': ['ilikestackoverflow.com'],
 'etc': ['etc']}

Besides, you can consider using 'set' to delete duplicated words or domains.

Tyrion
  • 405
  • 8
  • 22