This is an interesting problem to work with.
First of all, let's import all the required modules.
import difflib
import itertools
import re
Data cleaning
Then, we will perform some data cleaning, or preprocessing, with a lambda expression. The cleaning process will cover the following steps.
- Remove all non-letter characters
- Replace all kinds of space like newline
\n
and tab \t
with a single space
- Lowercase the text files and split them into words
a = """Lorem ipsum dolor sit amet, connectum adipiscing elit.
Sed do eisumodus temporis incididunt ut labore dororis magnum alida."""
b = """Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""
c = """Lorem ipsum dolor sit amet, consectetor adipiscat elit,
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
a, b, c = map(f, [a, b, c])
print(a, b, c, sep="\n\n")
Changing the data structure
Before we begin, the current data structure is not fine to work with because the text files cannot be named and it lacks scalability. What if you need to work with four, five, or more files? We will use a dictionary to store the text files from now on.
data = {
"text_a": """Lorem ipsum dolor sit amet, connectum adipiscing elit.
Sed do eisumodus temporis incididunt ut labore dororis magnum alida.""",
"text_b": """Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.""",
"text_c": """Lorem ipsum dolor sit amet, consectetor adipiscat elit,
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
}
f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
data = {i: f(v) for i, v in data.items()}
print(data)
Sequence matching
Next, we will use itertools.combinations
to compare two files at a time. (m, a)
is the first key-value pair, for example ("text_a", ["lorem", "ipsum", ...])
, and (n, b)
is the second.
for (m, a), (n, b) in itertools.combinations(data.items(), 2):
After that, difflib
will find the matching blocks of words. Blocks with size of fewer than two words are not considered expressions, and therefore, will be skipped.
output = {}
for (m, a), (n, b) in itertools.combinations(data.items(), 2):
matcher = difflib.SequenceMatcher(None, a, b)
for i1, _, size in matcher.get_matching_blocks():
seq = " ".join(a[i1:i1 + size])
if size < 2:
continue
if seq not in output:
output[seq] = {"freq": 1, "files": set()}
freq, files = output[seq]["freq"], output[seq]["files"]
output[seq]["files"] = output[seq]["files"].union({m}, {n})
output[seq]["freq"] = len(output[seq]["files"])
print(output)
The complete code
Here is the complete and working code.
import difflib
import itertools
import re
data = {
"text_a": """Lorem ipsum dolor sit amet, connectum adipiscing elit.
Sed do eisumodus temporis incididunt ut labore dororis magnum alida.""",
"text_b": """Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.""",
"text_c": """Lorem ipsum dolor sit amet, consectetor adipiscat elit,
sid du eisumodos tempor incididunt at laboris et dolor magnum aliquos."""
}
f = lambda x: re.sub(r"[^a-z ]", "", re.sub("\s+", " ", x.lower())).split()
data = {i: f(v) for i, v in data.items()}
output = {}
for (m, a), (n, b) in itertools.combinations(data.items(), 2):
matcher = difflib.SequenceMatcher(None, a, b)
for i1, _, size in matcher.get_matching_blocks():
seq = " ".join(a[i1:i1 + size])
if size < 2:
continue
if seq not in output:
output[seq] = {"freq": 1, "files": set()}
freq, files = output[seq]["freq"], output[seq]["files"]
output[seq]["files"] = output[seq]["files"].union({m}, {n})
output[seq]["freq"] = len(output[seq]["files"])
print(output)
Output
This gives the following output.
{
'lorem ipsum dolor sit amet': {
'freq': 3, 'files': {'text_b', 'text_a', 'text_c'}
},
'adipiscing elit sed do': {
'freq': 2, 'files': {'text_b', 'text_a'}
},
'incididunt ut labore': {
'freq': 2, 'files': {'text_b', 'text_a'}
},
'tempor incididunt': {
'freq': 2, 'files': {'text_b', 'text_c'}
}
}