0

In reference to this question of mine, I realized that the corpus is too big and needs be split up in multiple mini-lists before going through the levenshtein calculations. The following code is my simple attempt, but I was wondering if there's a more elegant way to do it:

import csv#, StringIO
import itertools, Levenshtein

# open the newline-separated list of words
path = '/Users/path/'
file = path + 'wordlist.txt'
output1 = path + 'ortho1.csv'
output2 = path + 'ortho2.csv'
output3 = path + 'ortho3.csv'
output4 = path + 'ortho4.csv'
output5 = path + 'ortho5.csv'
output6 = path + 'ortho6.csv'

words = sorted(set(s.strip() for s in open(file)))

# words is a list with 16349, so I split it in to 6 mini lists
verbs1 = words[:3269]
verbs2 = words[3269:13080]
verbs3 = words[13081:9811]
verbs4 = words[9812:6542]
verbs5 = words[6543:3273]
verbs6 = words[3274:len(words)]

For each of the above lists, I then compute the following loop:

with open(output1, 'wb') as f:  
   writer = csv.writer(f, delimiter=",", lineterminator="\n")   
   for a, b in itertools.product(verbs1, words):        
       if (a < b and Levenshtein.distance(a,b) <= 5):
                    writer.writerow([a, b, Levenshtein.distance(a,b)])

Again, everything works, but I was wondering there's a more elegant way to code a single loop for each of the mini-lists.

RobertP.
  • 213
  • 1
  • 12
  • 1
    `words[13081:9811]` Did you try this? Won't this just be an empty list, since the "to" index is lower than the "from" index? Also, you should probably use lists instead of 6 individual variables. – tobias_k Dec 09 '17 at 21:17
  • 1
    you can split the verbs into a dictionary of lists, your first code section is funcy, the rest is fine. – user1767754 Dec 09 '17 at 21:19
  • Also, I don't see how splitting the list helps you at all, if you do the product for them all. `a*x + b*x + c*x` is the same as `(a+b+c) * x`. – tobias_k Dec 09 '17 at 21:19
  • @tobias_k: You're right! I messed up the indices. I wasn't clear completely clear in my question - excel can't open a single file so I'm trying to split a single csv in multiple csv's. But I don't seem to succeed at all - any hint? – RobertP. Dec 09 '17 at 21:21
  • There are certainly many optimizations you could do, if you are only interested in levelsthein distances <= 5. Calculating the edit distance if O(n²), so before doing this, you could first check whether the difference in length in <= 5, and whether the difference in the sets of used letters is <= 5. Both will be much faster, and can probably rule out many combinations before computing the edit distance. – tobias_k Dec 09 '17 at 21:24
  • Isn't that what my loop does? Not the length part- I mean my code just controls for the levenshtein distance to be <= 5, and then writes in on the csv file. – RobertP. Dec 09 '17 at 21:26

2 Answers2

1

Put the verbs in a list:

verbs = [words[:3269],words[3269:13080],words[13081:9811],words[9812:6542],
         words[6543:3273],words[3274:len(words)]]

And then use the length of that list to create a loop with same length. By using the index we can create the path and access the correct element in verbs.

for i in range(len(verbs)):
    output = '{}ortho{}.csv'.format(path,i+1)
    with open(output, 'wb') as f:  
        writer = csv.writer(f, delimiter=",", lineterminator="\n")   
        for a, b in itertools.product(verbs[i], words):        
            if (a < b and Levenshtein.distance(a,b) <= 5):
               writer.writerow([a, b, Levenshtein.distance(a,b)])
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
1

There are a few problems with your code, and other points you could improve:

  • instead of having six different variables for verbs and output each, use two lists; this way you can more easily adapt the "splitting points" or number of sublists, and you do not have to copy-paste the code block for comparing the words six times; just use another loop
  • the sublist words[13081:9811] is empty, and also any others where the second index is smaller than the first
  • with verbs1 = words[:3269] and verbs2 = words[3269:13080], words[3269] will be in neither of the sublists, as the second index is exclusive; same for the following lists
  • just in case this was your intention, splitting the list will not reduce the complexity or the running time, as you still have to compare each of the words; a*x + b*x + c*x is the same as (a+b+c) * x
  • instead of checking a < b and dismissing half of the product, use combinations instead (but this only works when not splitting the list)
  • if you are only interested in pairs with an edit distance <= 5, you could do a few other checks first, like comparing the length of the two words, or the set difference of contained characters; both of those checks will be faster then the actual edit distance check, which is O(n²), and might rule out many combinations
  • for the same reason, do not calculate the edit distance twice, once in the check and again for writing it to the file, but just once and store it in a temporary variable
  • if you split the file so that the output file does not become too large for Excel to handle (as I understood one of your comments), your approach might not work, as the size of the output file can vary drastically, depending on how many matches there are in that sublist

Combining the above, you could try something like this (not tested):

path = '/Users/path/'
with open(path + 'wordlist.txt') as infile:
    words = set(s.strip() for s in infile)

combs = itertools.combinations(words, 2)
max_count = 10**6 # or whatever Excel can handle
for i, chunk in enumerate(chunks(combs, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b in chunk:
            if might_be_close(a, b, 5):
                d = Levenshtein.distance(a,b)
                if d <= 5:
                    writer.writerow([a, b, d])

Here, chunks is a function to split an iterator into chunks and might_be_close is a helper function comparing e.g. the length, or sets of contained letters, as described above. The size of the output file might still vary, but will never exceed max_count.

Or try this, to get output files with exactly max_count entries:

max_count = 10**6 # or whatever Excel can handle
matches = filter_matches(itertools.combinations(words, 2), 5)
for i, chunk in enumerate(chunks(matches, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b, d in chunk:
            writer.writerow([a, b, d])

def filter_matches(combs, max_dist):
    for a, b in combs:
        if might_be_close(a, b, max_dist):
            d = Levenshtein.distance(a,b)
            if d <= max_dist:
                yield a, b, d

Here, the filter_matches generator pre-filters the combinations and we are chunking those to the right size.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • I understand, but I'd need some help here. How do I use `itertools.combinations(iterable, r)` here? – RobertP. Dec 09 '17 at 21:46
  • @RobertP. Well, `combinations` would only work if you do not split the list, as it only accepts one iterable and the number of elements per combination (2 in your case) – tobias_k Dec 09 '17 at 21:50
  • Right - that was why I was perplexed. So, I guess there's no way I can use `combinations` if I need to split up the list in multiple mini-lists (and that is, I think, the only way to make me able to see all the data on excel). – RobertP. Dec 09 '17 at 21:52
  • Thanks! I'll try this and get back to you right away. – RobertP. Dec 09 '17 at 22:08
  • For the first open, it gives me the following error: `TypeError: coercing to Unicode: need string or buffer, file found'. On the internet, this is usually connected to the fact that the code wants to use the command `open` twice. – RobertP. Dec 09 '17 at 22:29
  • it also needs definition of `chunks` – RobertP. Dec 09 '17 at 22:33
  • Ok I copied and pasted from the page you linked in your answer. Now another error: it says that there's no `ortho0.csv` in the directory. – RobertP. Dec 09 '17 at 22:38
  • I did it: `for i, chunk in enumerate(chunks(combs, max_count), start=1)`. – RobertP. Dec 09 '17 at 22:46
  • 1
    @RobertP. Did you read the answer? You have to provide that function yourself, doing the pre-checks I mentioned. Or just leave it and check the edit distance for each combination. – tobias_k Dec 09 '17 at 22:48