-2

My problem

I have two lists 'predicted' and 'reference'. Each list contains strings, the first one being the predicted elements output by my model, and the latter being the gold-standard. I want to build an automatic error classifier, but can't figure out compare each character within each string within each list. I can compare wordwise (code included below) but I want to look character-by-character.

Below is the code for my word-wise comparer, along with the lists of data I'm working with NB, outside of this toy example, these lists are about 3000 items long.

predicted = ['r * a k t\n', 'd * o u l\n', 'm * i s l\n', 'p * i . v @ p\n']
reference = ['r A k t\n', 'd * o u b\n, 'm * i s l\n, 'i * p . v @ t\n']

########### word-wise finder ##############
p = set(predicted)
r = set(reference)
errors = p - r

return(errors)

My code above returns me:

'r * a k t\n', 'd * o u l\n', 'p * i . v @ p\n'

My dream would be to have a returned list that looks like this:

['* a', 'l', 'p * i', 'p']

because I can then look at each element an classify the mistake it's made. Any advice is appreciated.

  • https://stackoverflow.com/questions/18454570/how-can-i-subtract-two-strings-in-python – clubby789 Aug 05 '19 at 14:06
  • I don't see how that linked question resolves mine, I'm not looking to remove matching list elements, but rather to look at each element within a list and then return the elements which do not match, could you explain if I'm missing something? – MadDanWithABox Aug 05 '19 at 14:13
  • If you convert each string into the list into a nested list, removing the matching elements will leave you with the non-matching elements. – clubby789 Aug 05 '19 at 14:14

1 Answers1

0

My best guess is that you are looking for a character by character diff of each pair of words.

Assuming that you're looking for a minimal difference and the order of the characters matters, https://docs.python.org/3/library/difflib.html provides a SequenceMatcher that implements the right algorithm. Its output is a little confusing.

import difflib
print(difflib.SequenceMatcher(a='r * a k t\n', b='r A k t\n').get_opcodes()
# printed: [('equal', 0, 2, 0, 2), ('replace', 2, 5, 2, 3), ('equal', 5, 10, 3, 8)]

Which literally means that characters in range(0, 2) == [0, 1] in each are the same. That is, 'r ' matches).

Then the characters in range(2, 5) == [2, 3, 4] in the first string have to be replaced by the characters in range(2,3) == [2] in the second string. So '* a' gets replaced with 'A'.

And then the characters in range(5, 10) == [5, 6, 7, 8, 9] in the first string match the characters in range(3, 8) == [3, 4, 5, 6, 7] in the second string. In other words ' k t\n' matches.

For the format that you seem to be looking for (stuff in the first list not in the second), it suffices to look for only opcodes replace and delete. The other two opcodes are equal and insert.

btilly
  • 43,296
  • 3
  • 59
  • 88