2

I have a list of two strings, and I want to highlight and print differences between two strings (specifically in Jupyter notebook). By differences, I specifically mean the insertions, deletions and replacements needed to change one of the strings to the other.

I found this question which is similar but doesn't mention a way to present the changes.

Burhan
  • 668
  • 5
  • 27

1 Answers1

2

I figured out an effective way to display such highlighting and want to share it with others.

The difflib module gives you the tools to effectively find the differences, specifically the SequenceMatcher class, while the IPython.display module helps you highlight the differences in a notebook setting.

Demonstration

First, let's assume the data in the following format:

cases = [
    ('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),    
    ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
    ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
    ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
    ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
    ('abcdefg','xac')
]

You can create a function that gives you the HTML string which highlights the insertions, deletions and replacements, using the following code:

from difflib import SequenceMatcher

# highlight colors
# you may change these values according to your preferences
color_delete = '#811612'  # highlight color for deletions
color_insert = '#28862D'  # highlight color for insertions
color_replace = '#BABA26' # highlight color for replacements

# the common format string used for highlighted segments
f_str = '<span style="background: {};">{}</span>'

# given two strings (a, b), getFormattedDiff returns the HTML formatted strings (formatted_a, formatted_b)
def getFormattedDiff(a, b):
    # initialize the sequence matcher
    s = SequenceMatcher(None, a, b)

    # stringbuilders for the formatted strings
    formatted_a = []
    formatted_b = []

    # iterate through all char blocks
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        if tag == 'equal':
            # if the blovks are the same, append block to both strings without any formatting
            formatted_a.append(a[i1:i2])
            formatted_b.append(b[j1:j2])
        elif tag == 'delete':
            # if this is a deletion block, append block to the first string with the delete highlight
            formatted_a.append(f_str.format(color_delete, a[i1:i2]))
        elif tag == 'insert':
            # if this is a insertion block, append block to the second string with the insert highlight
            formatted_b.append(f_str.format(color_insert, b[j1:j2]))
        elif tag == 'replace':
            # if this is a replacement block, append block to both strings with the replace highlight
            formatted_a.append(f_str.format(color_replace, a[i1:i2]))
            formatted_b.append(f_str.format(color_replace, b[j1:j2]))

    # return the formatted strings
    return ''.join(formatted_a), ''.join(formatted_b)

Now we run the above defined function in a loop for all the cases strings like so:

from IPython.display import HTML, display

# iterate through all the cases and display both strings with the highlights
for a, b in cases:
    formatted_a, formatted_b = getFormattedDiff(a, b)
    display(HTML(formatted_a))
    display(HTML(formatted_b))
    print()

and we get the following display output:

enter image description here

Burhan
  • 668
  • 5
  • 27