1

What is the best way to unpack SequenceMatcher loop results in Python so that values can be easily accessed and processed?

from difflib import *

orig = "1234567890"

commented = "123435456353453578901343154"

diff = SequenceMatcher(None, orig, commented)

match_id = []
for block in diff.get_matching_blocks():
    match_id.append(block)

print(match_id)

String integers represent Chinese Characters.

The current iteration code stores match results in a list like this:

match_id
[Match(a=0, b=0, size=4), Match(a=4, b=7, size=2), Match(a=6, b=16, size=4), Match(a=10, b=27, size=0)]

I'd eventually like to mark out the comments with "{{" and "}}" like so:

"1234{{354}}56{{3534535}}7890{{1343154}}"

Which means, I am interested in unpacking the above SequenceMatcher results and do some calculations on specific b and size values to yield this sequence:

rslt = [[0+4,7],[7+2,16],[16+4,27]]

which is a repetition of [b[i]+size[i],b[i+1]].

Sati
  • 716
  • 6
  • 27

3 Answers3

1

1. Unpacking SequenceMatcher results to yield a sequence

You can unzip match_id and then use a list comprehension with your expression.

a, b, size = zip(*match_id)
# a    = (0, 4,  6, 10)
# b    = (0, 7, 16, 27)
# size = (4, 2,  4,  0)

rslt = [[b[i] + size[i], b[i+1]] for i in range(len(match_id)-1)]
# rslt = [[4, 7], [9, 16], [20, 27]]

Reference for zip, a Python built-in function: https://docs.python.org/3/library/functions.html#zip

2. Marking out the comments with "{{" and "}}"

You can loop through rslt and then nicely append the match-so-far and mark out the comments.

rslt_str = ""
prev_end = 0

for start, end in rslt:
    rslt_str += commented[prev_end:start]
    if start != end:
        rslt_str += "{{%s}}" % commented[start:end]
    prev_end = end
# rslt_str = "1234{{354}}56{{3534535}}7890{{1343154}}"
aaron
  • 39,695
  • 6
  • 46
  • 102
1

I would do it like this:

from difflib import *

orig = "1234567890"
commented = "123435456353453578901343154"

diff = SequenceMatcher(None, orig, commented)

match_id = []
rslt_str = ""
for block in diff.get_matching_blocks():
    match_id.append(block)

temp = 0
for i, m in enumerate(match_id[:-1]):
    rslt_str += commented[temp:m.b + m.size] + "{{"
    rslt_str += commented[m.b + m.size: match_id[i+1].b] + "}}"
    temp = match_id[i+1].b

so that rslt_str == "1234{{354}}56{{3534535}}7890{{1343154}}"

man zet
  • 826
  • 9
  • 26
  • I like your code for being concise, but guess it is much harder for a beginner to understand and follow. – Sati Dec 25 '19 at 08:25
1

You can try this:

from difflib import *

orig = "1234567890"
commented = "123435456353453578901343154"
diff = SequenceMatcher(None, orig, commented)

a, b, size = zip(*diff.get_matching_blocks())

start = {x + y : '{{' for x, y in zip(b[:-1],size)}
end = dict.fromkeys(b[1:], '}}')
rslt = {**start, **end}

final_str = ''.join(rslt.get(ix,'') + n for ix, n in enumerate(commented)) + '}}'

print(final_str)

Output:

'1234{{354}}56{{3534535}}7890{{1343154}}'

Explanation:

As SequenceMatcher().matching_blocks() is iterable, so you can directly unpack it to your variables.

  1. Then create a dictionary with starting indices as keys and {{ as value.
  2. Similarly, create a dictionary with ending indices as keys and }} as value.
  3. Unpack both the dictionaries in rslt.

Then form a list by passing the characters of commented as default value to dict.get and for indices in the rslt dict, prepend character with corresponding curly braces. Finally join into string.

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52