There are a few problems with your code, and other points you could improve:
- instead of having six different variables for
verbs
and output
each, use two lists; this way you can more easily adapt the "splitting points" or number of sublists, and you do not have to copy-paste the code block for comparing the words six times; just use another loop
- the sublist
words[13081:9811]
is empty, and also any others where the second index is smaller than the first
- with
verbs1 = words[:3269]
and verbs2 = words[3269:13080]
, words[3269]
will be in neither of the sublists, as the second index is exclusive; same for the following lists
- just in case this was your intention, splitting the list will not reduce the complexity or the running time, as you still have to compare each of the words;
a*x + b*x + c*x
is the same as (a+b+c) * x
- instead of checking
a < b
and dismissing half of the product
, use combinations
instead (but this only works when not splitting the list)
- if you are only interested in pairs with an edit distance
<= 5
, you could do a few other checks first, like comparing the length of the two words, or the set difference of contained characters; both of those checks will be faster then the actual edit distance check, which is O(n²), and might rule out many combinations
- for the same reason, do not calculate the edit distance twice, once in the check and again for writing it to the file, but just once and store it in a temporary variable
- if you split the file so that the output file does not become too large for Excel to handle (as I understood one of your comments), your approach might not work, as the size of the output file can vary drastically, depending on how many matches there are in that sublist
Combining the above, you could try something like this (not tested):
path = '/Users/path/'
with open(path + 'wordlist.txt') as infile:
words = set(s.strip() for s in infile)
combs = itertools.combinations(words, 2)
max_count = 10**6 # or whatever Excel can handle
for i, chunk in enumerate(chunks(combs, max_count)):
with open("%sortho%d.csv" % (path, i), "w") as outfile:
writer = csv.writer(outfile, delimiter=",", lineterminator="\n")
for a, b in chunk:
if might_be_close(a, b, 5):
d = Levenshtein.distance(a,b)
if d <= 5:
writer.writerow([a, b, d])
Here, chunks
is a function to split an iterator into chunks and might_be_close
is a helper function comparing e.g. the length, or sets of contained letters, as described above. The size of the output
file might still vary, but will never exceed max_count
.
Or try this, to get output files with exactly max_count
entries:
max_count = 10**6 # or whatever Excel can handle
matches = filter_matches(itertools.combinations(words, 2), 5)
for i, chunk in enumerate(chunks(matches, max_count)):
with open("%sortho%d.csv" % (path, i), "w") as outfile:
writer = csv.writer(outfile, delimiter=",", lineterminator="\n")
for a, b, d in chunk:
writer.writerow([a, b, d])
def filter_matches(combs, max_dist):
for a, b in combs:
if might_be_close(a, b, max_dist):
d = Levenshtein.distance(a,b)
if d <= max_dist:
yield a, b, d
Here, the filter_matches
generator pre-filters the combinations and we are chunking those to the right size.