0

I am trying to compare two text files of about 1MB each in Python using difflib's SequenceMatcher. I find that it gives a really poor time complexity when comparing files of this size taking up to 7 minutes last time I ran it.

Is there a more efficient way in Python to achieve this, without the use of hashing, that will also provide the percentage or ratio of similarity between the two files.

This is my existing code:

file1 = input()
file2 = input()
text1 = open("./text-files/" + f1 + ".txt").read()
text2 = open("./text-files/" + f2 + ".txt").read()
m = SequenceMatcher(None, text1, text2)
print(m.ratio())

Thanks

martineau
  • 119,623
  • 25
  • 170
  • 301
Caoimhe
  • 9
  • 4
  • How about calculating a document vectors for both files and then check the distance. We use it at school to see if students cheat or not. Also google uses that - I think to find page similarities? – Sedy Vlk Oct 26 '17 at 19:43
  • The `SequenceMatcher` class is defined in the `difflib`'s [source file](https://github.com/python/cpython/blob/3.6/Lib/difflib.py), Since the source is available you can profile it to find the most important place(s) to spend _your_ time optimizing the code (i.e. where it's spending most of its execution time). Profiling is fairly easy, see [**How can you profile a script?**](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) – martineau Oct 26 '17 at 20:15

0 Answers0