Find similar expressions without knowing them in multiple texts with Python3

Question

In a folder of text files, I would like to extract all similar groups of words or similar expressions between files. I do not know the words in advance. The expressions string length are between 2 and n words.

The ideal output would be a dictionary with the frequency of the pattern and the list of files containing it.

Is it possible to do it with the difflib library ? If not, could you suggest one I can look for ?

Examples :

text_file1 = """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""

text_file2 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

text_file3 = """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""

output = {
"Lorem ipsum dolor sit amet":
    "freq" : 3,
    "files" : ['text_file1','text_file2','text_file3'],
"adipiscing elit":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
"sed do":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
"tempor incididunt":
    "freq" : 2,
    "files" : ['text_file2','text_file3'],
"incididunt ut labore":
    "freq" : 2,
    "files" : ['text_file1','text_file2'],
}

Do capitalization or punctuation marks such as ", . ! ?" matter? — Tzane, Nov 05 '21 at 12:07
How do you define 'similar'? Are you looking for identical substrings? Or can there be differences? If differences, can the words be different lengths? It can be a complex problem — Chris_Rands, Nov 05 '21 at 19:23
See also answers here Checkout answers here https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings — Chris_Rands, Nov 05 '21 at 19:23
How do you define what an expression is ? It seems to me there are endless possibilities in grouping words? — Tobias P. G., Nov 05 '21 at 20:42
@TobiasP.G. here: **an expression = at least 2 words separated by 1 space** (I'm looking in writing texts). — Thammas, Nov 06 '21 at 14:47
@Tzane Capitalization and punctuantion marks don't matter. ex: "The code" is similar to "the code" in this case. — Thammas, Nov 06 '21 at 14:50
@Tim Roberts How can I use grep if I don't know the search terms ? I'm looking a way to find if similar expressions exist in files and extract them if they exist. — Thammas, Nov 06 '21 at 14:56
@Chris_Rands I understand it can be a complex problem. Maybe I should define better the goal : looking for possible connections between files by finding common group of words (at least 2). I saw the answers in the question you suggest. I understand how to calculate the distance between 2 strings. Then my question would be : how to compare all the "2 words", all "3 words", all "n words"... in hundreds of text files ? Maybe should I edit my original question. — Thammas, Nov 06 '21 at 15:16
I admit that I misunderstood your goal. What you're basically after is, split the files into words, and run `diff` on the words, looking for common sequences. That's not enough, of course, but it's that philosophy. Now I understand why you asked about `difflib`. I'll ponder this. — Tim Roberts, Nov 06 '21 at 18:43

score 1 · Answer 1 · answered Nov 06 '21 at 18:58

This seems to solve the problem, using the algorithm I described in my comment. I'm ignoring 1 word matches. I've also removed punctuation and shifted to lower case; that's easy to change.

import difflib
import itertools

text_file1 = """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""

text_file2 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

text_file3 = """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""

def preprocess(s):
    return s.lower().replace('.','').replace(',','').split()

targets = [
    ('text_file1', preprocess(text_file1) ),
    ('text_file2', preprocess(text_file2) ),
    ('text_file3', preprocess(text_file3) ),
]

output = {}


for combo in itertools.combinations( targets, 2 ):
    matcher = difflib.SequenceMatcher( a=combo[0][1], b=combo[1][1] )
    for m in matcher.get_matching_blocks():
        if m.size > 1:
            match = ' '.join( combo[0][1][m.a:m.a+m.size] )
            if match not in output:
                output[match] = {'freq':0, 'files':set() }
            output[match]['files'].add( combo[0][0] )
            output[match]['files'].add( combo[1][0] )

for k in output:
    output[k]['freq'] = len(output[k]['files'])

from pprint import pprint
pprint(output)

Output:

{'adipiscing elit sed do': {'files': {'text_file2', 'text_file1'}, 'freq': 2},
 'incididunt ut labore': {'files': {'text_file2', 'text_file1'}, 'freq': 2},
 'lorem ipsum dolor sit amet': {'files': {'text_file1',
                                          'text_file2',
                                          'text_file3'},
                                'freq': 3},
 'tempor incididunt': {'files': {'text_file2', 'text_file3'}, 'freq': 2}}

score 1 · Accepted Answer · answered Nov 07 '21 at 03:59

This is an interesting problem to work with.

First of all, let's import all the required modules.

import difflib
import itertools
import re

Data cleaning

Then, we will perform some data cleaning, or preprocessing, with a lambda expression. The cleaning process will cover the following steps.

Remove all non-letter characters
Replace all kinds of space like newline \n and tab \t with a single space
Lowercase the text files and split them into words

a = """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""

b = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

c = """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""

f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
a, b, c = map(f, [a, b, c])

print(a, b, c, sep="\n\n")

Changing the data structure

Before we begin, the current data structure is not fine to work with because the text files cannot be named and it lacks scalability. What if you need to work with four, five, or more files? We will use a dictionary to store the text files from now on.

data = {
    "text_a": """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida.""",

    "text_b": """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.""",

    "text_c": """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
}

f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
data = {i: f(v) for i, v in data.items()}

print(data)

Sequence matching

Next, we will use itertools.combinations to compare two files at a time. (m, a) is the first key-value pair, for example ("text_a", ["lorem", "ipsum", ...]), and (n, b) is the second.

for (m, a), (n, b) in itertools.combinations(data.items(), 2):

After that, difflib will find the matching blocks of words. Blocks with size of fewer than two words are not considered expressions, and therefore, will be skipped.

output = {}
for (m, a), (n, b) in itertools.combinations(data.items(), 2):
    matcher = difflib.SequenceMatcher(None, a, b)

    for i1, _, size in matcher.get_matching_blocks():
        seq = " ".join(a[i1:i1 + size])

        if size < 2:
            continue

        if seq not in output:
            output[seq] = {"freq": 1, "files": set()}

        freq, files = output[seq]["freq"], output[seq]["files"]

        output[seq]["files"] = output[seq]["files"].union({m}, {n})
        output[seq]["freq"] = len(output[seq]["files"])

print(output)

The complete code

Here is the complete and working code.

import difflib
import itertools
import re

data = {
    "text_a": """Lorem ipsum dolor sit amet, connectum adipiscing elit. 
Sed do eisumodus temporis incididunt ut labore dororis magnum alida.""",

    "text_b": """Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.""",

    "text_c": """Lorem ipsum dolor sit amet, consectetor adipiscat elit, 
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
}

f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
data = {i: f(v) for i, v in data.items()}

output = {}
for (m, a), (n, b) in itertools.combinations(data.items(), 2):
    matcher = difflib.SequenceMatcher(None, a, b)

    for i1, _, size in matcher.get_matching_blocks():
        seq = " ".join(a[i1:i1 + size])

        if size < 2:
            continue

        if seq not in output:
            output[seq] = {"freq": 1, "files": set()}

        freq, files = output[seq]["freq"], output[seq]["files"]

        output[seq]["files"] = output[seq]["files"].union({m}, {n})
        output[seq]["freq"] = len(output[seq]["files"])

print(output)

Output

This gives the following output.

{
    'lorem ipsum dolor sit amet': {
        'freq': 3, 'files': {'text_b', 'text_a', 'text_c'}
    },
    'adipiscing elit sed do': {
        'freq': 2, 'files': {'text_b', 'text_a'}
    }, 
    'incididunt ut labore': {
        'freq': 2, 'files': {'text_b', 'text_a'}
    },
    'tempor incididunt': {
        'freq': 2, 'files': {'text_b', 'text_c'}
    }
}

Tested with 5000 files, work as expected ! I just change the pattern to match accents. — Thammas, Nov 07 '21 at 19:16