1

I have two files: fileA and fileB. I'd like to get the line numbers of all the lines in the fileB that exist in the fileA. But if a line is indeed in fileA, I won't identify it as "exists in fileA" unless the next line is also in it. So I've written the following code:

def compare_two(fileA, fileB):
    with open(fileA, 'r') as fa:
        fa_content = fa.read()
        with open(fileB, 'r') as fb:
            keep_line_num = []  # the line number that's not in fileA
            i = 1
            while True:
                line = fb.readline()
                if line == '':  # There are no blank lines in both files
                    break
                last_pos = fb.tell()
                theFollowing = line
                new_line = fb.readline()  # get the next line
                theFollowing += new_line
                fb.seek(last_pos)
                if theFollowing not in fa_content:
                    keep_line_num.append(i)
                i += 1
        fb.close()
    fa.close()
    return keep_line_num

compare_two(fileA, fileB)

This works fine for small files. But I want to use it for large files as large as 2GB and this method is too slow for me. Are there any other way to work with this in Python2.7?

joe wong
  • 453
  • 2
  • 9
  • 24
  • Try constructing a lookup dict that uses lines in A as key and line numbers as value. The lookups should become `O(log(#line))` instead of `O(filesize)`. Not too sure though, so comment instead of answer. – T Tse Feb 27 '18 at 10:01
  • I am not sure how much better this can really get with python. Even if you were to use dictionaries, I still think it would be slow. You may be able to do something super fancy with multi-processing. If speed is what you are after though, it might be wise to try to write this in something like Golang or C++ – Ryan Feb 27 '18 at 10:16
  • Why not hashing a small portion of every line, looking for exact same values, and comparing the entire lines of those which match? It might work faster if the lines are usually kind of long – Shinra tensei Feb 27 '18 at 10:33

1 Answers1

0

Take a look at difflib, it comes with Python.

It can tell you where your files are different or identical. See also python difflib comparing files

Joe
  • 6,758
  • 2
  • 26
  • 47