So I want to determine the words in a given string. These strings are domain names. I have approximately 5000 domain names and a dictionary of 60000 dictionary words to check. This would result in checking 60000 times per domain, totalling to approximately 300.000.000 operations, which is just madness.
Therefore I would like to ask if there is a smarter way to solve this problem to still get the words present in the string.
I've tried to do it with a simple loop, but I guess this requires a smarter solution to work with the immense quantity of checks.
dictionary_of_words = ["I", "Stack", "overflow", "like", etc]
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", etc]
def in_dictionary(AllDomains):
#Setting a new column
AllDomains["dictionary"] = False
AllDomains["words"] = None
for i in range(len(AllDomains)):
# Scan if the entire word is in the dictionary
if AllDomains["domain"].str.strip(".nl").str.lower().iloc[i] in dictionary_of_words:
AllDomains["dictionary"].iloc[i] = True
print(AllDomains["domain"].iloc[i])
# Scan which words there are in the domain
else:
for word in dictionary_of_words:
print(word)
if word in AllDomains["domain"].str.strip(".nl").str.lower().iloc[i]:
if AllDomains["words"].iloc[i] == None:
AllDomains["words"].iloc[i] = word
else:
AllDomains["words"].iloc[i] = AllDomains["words"].iloc[i] + f", {word}"
print(AllDomains["domain"].iloc[i])
return AllDomains