-5

Let's say I have two string lists in Python (but this problem is not really language-specific):

a = ["cat", "dog", "fish"]
b = ["cat", "dog", "fish"]

My goal is to be able to quantify the difference of those two lists. More specifically, my program has to calculate how "alike" is list 1 to list 2 and give it a "score". I am using that to calculate the error in some results I get. I process some audio and I get an list which is the result. I want to compare that result to the result I should have gotten.

Therefore, in the above examples the result is identical to the correct result so the answer should be 1 (100%).

In this case:

a = ["cat", "dog", "fish", "lion"]
b = ["cat", "dog", "fish", "tiger"]

The result is 0.75 (75%).

Here is my code:

def compare_lists(result, correct):
  # TODO: This could be way better.
  if len(result) != len(correct):
    return 0
  else:
    sum = 0
    for i in range(0, len(result)):
      if result[i] == correct[i]:
        sum += 1
  return float(sum) / float(len(result))

However, problems arise when the lists have different lengths. For example:

a = ["cat", "dog", "zebra", "fish"]
b = ["dog", "zebra", "fish"]

The logic described before cannot be applied here. In this case, b is almost the same as a but a has one more element in the beginning. I want to be able to correctly quantify this "similarity", as my current algorithm returns 0, but in reality my result with the correct result do not have a big difference.

pavlos163
  • 2,730
  • 4
  • 38
  • 82
  • `len(set(a).symmetric_difference(b))` is what you're looking for? And are you referring to *logic* as that of the code or the requirement? – Moses Koledoye May 15 '17 at 15:45
  • "Using the same logic as before, the result would be something like 3 or 4." In order to program any algorithm, you have to specify it. It's hard to help you with something when you don't know yourself exactly what it is. – pvg May 15 '17 at 15:46
  • What is the exact number you want to calculate? It's not at all clear from the question. You should provide some more test cases – Chris_Rands May 15 '17 at 15:46
  • Possible duplicate of [How can I iterate through two lists in parallel?](http://stackoverflow.com/questions/1663807/how-can-i-iterate-through-two-lists-in-parallel) and you want to use `itertools.zip_longest` – juanpa.arrivillaga May 15 '17 at 15:52
  • 1
    See [**`difflib`**](https://docs.python.org/2/library/difflib.html) – Peter Wood May 15 '17 at 15:54
  • Also, those are Python *lists* not arrays. – juanpa.arrivillaga May 15 '17 at 15:56
  • @Peter Wood: `difflib` seems like what I want. If I call the `ratio()` function for my two lists, will it take into account the similarity of the strings themselves? Because I don't want that. – pavlos163 May 15 '17 at 16:10
  • Why does this question have such a bad rating? There is no duplicate question and I think that after editing the question is clear. – pavlos163 May 15 '17 at 16:54

2 Answers2

1

You can try this algorithm:

a = ["cat", "dog", "dog", "fish"]
b = ["dog", "dog", "fish"]

if len(a) != len(b):

    results = ''.join(a).split(''.join(b))
    results = [i for i in results if i != '']
    print results
    print len(results)

    print((len(a)-len(results))/float(len(a)))

else:
   off_by = (len(b)-len([1 for i in range(len(b)) if b[i] != a[i]]))/float(len(a))
   print(off_by)

The goal of this algorithm is to determine by how much one list differs from the other. In this example, the algorithm notices that the list b exists in a, but with another element in front of it. Therefore, the list a differs from list b by one. Thus, we can say that the score should be 0.75 because three of the four elements in list a belong in list b, or 3/4

Ajax1234
  • 69,937
  • 8
  • 61
  • 102
-1

difflib's SequenceMatcher does exactly what I wanted.

  s = difflib.SequenceMatcher(None, result, correct)
  s.ratio()
pavlos163
  • 2,730
  • 4
  • 38
  • 82