Generator function to find common lines between two files

Question

I am reading Python Object-oriented programming by Steven and Dusty.

I am at Chapter 10, and it's all about Iterator design pattern. At the end of chapter there is an exercise to write generator function/expression to find common lines between two files.

Here are my naive approaches:

old = os.path.normpath('E:/testing/old.txt')
new = os.path.normpath('E:/testing/new.txt')
res = []

def func(source):
    for line in source.readlines():
        yield line

Approach 1:

with open(old, "r") as f:
    for line in func(f):
        with open(new, "r") as _f:
            for _line in func(_f):
                if line == _line:
                    res.append(line)

Approach 2:

with open(old, "r") as f:
    for line in func(f):
        with open(new, "r") as _f:
            res.extend(filter(lambda a: line == a, func(_f)))

Context:

Now, let's think about totally different problem, let's assume we have list of strings instead of files with n and m elements, we can find common strings in O(n) time complexity with O(m) space.

Big questions

Is there any better pythonic approach to re-factor above codes?
I know generators are all about saving space and state, but following the above context, is there a better way in terms of time complexity to compare two files and find common lines.

My end goal is to optimize my code (if it is possible) in terms of time and space complexity and to make it more pythonic.

P.S. - Can someone point me to the source code of linux diff command and git-diff source code.

I tried finding source code of linux diff and git-diff but was unable to.

File-like objects are already iterable: `for line in source:` would work just as well. — chepner, May 05 '23 at 12:59
you dont need to open file again and again, you can open them with single `with` statement — sahasrara62, May 05 '23 at 13:08
*"optimize in terms of time and space complexity"* - bear in mind that optimization is often a tradeoff. The code in your question is very space efficient (the working dataset is only one line of each file), but not very time efficient (if the "old" file has N1 lines, then you're reopening the "new" file N1 times and reading every one of its lines). — slothrop, May 05 '23 at 13:15
[GNU diff](https://git.savannah.gnu.org/cgit/). [Git diff](https://github.com/git/git/blob/master/diff.h). Note that diff is a program that solves [LSC](https://en.wikipedia.org/wiki/Longest_common_subsequence), so it is slightly different from the task of your code. — ken, May 05 '23 at 13:28

Booboo · Answer 1 · 2023-05-06T10:06:55.843

3

I don't believe this problem is best solved using generators. My approach would be:

old = 'E:/testing/old.txt'
new = 'E:/testing/new.txt'

with open(old) as f:
    old_lines = set(f) # A set of lines

with open(new) as f:
    matches = [line for line in f if line in old_lines]

print(matches)

If you want to compare lines stripped of the trailing newline character (since the last line may not have a trailing newline), then you could use the following code, which also demonstrates creating two sets and doing an "intersection" operation on them. In that way you do not have to explicitly loop through lines. But by creating two sets and computing their intersection you lose control over the order in which matches will be printed. It will also be less memory efficient and not run any faster (perhaps even a bit more slowly) compared with the first solution. Finally, if any of the files contain duplicated lines, you will compute a different, smaller set of changes.

old = 'E:/testing/old.txt'
new = 'E:/testing/new.txt'

with open(old) as f:
    old_lines = set(line.strip('\n') for line in f) # A set of lines

with open(new) as f:
    matches = set(line.strip('\n') for line in f) & old_lines

print(matches)

edited May 06 '23 at 10:06

answered May 05 '23 at 13:16

Booboo

38,656
3
37
60

1

@slothrop Wow! Thanks, I had a brain malfunction! My brain was thinking `&`, which I also mentally pronounce as "and", and so that is what I ended up unfortunately writing. – Booboo May 05 '23 at 14:24
Okay, so I see, we have to bring at least one file into the memory. One more question follow-up question, generator will not be space efficient if we process both the files in one go( assuming files are of gigabytes size or more. ), because they will not un-claim the space of previously read lines. Generators are efficient because they can save states, and we can halt the task in between in case of any overflow( hypothetical condition ) and resume it in later point of time? – commonSense May 06 '23 at 02:06
And realizing this fact, we cannot do any better than `O(n*m)`, right? – commonSense May 06 '23 at 02:08
Just to clarify my answer: My second coding example combines two ideas, (1) Comparing lines that have been stripped of the trailing newlines and (2) Creating 2 sets and then computing the intersection. Creating two sets will be less memory efficient and not save any time compared with the first approach; I just wanted to show another way of thinking about the problem. But if one file has `n` lines and the other `m`, then the time complexity should be `o(n + m)`, not `o(n * m)`. I am not sure what you mean by "we can halt the task in between in case of any overflow". – Booboo May 06 '23 at 09:55
1

@commonSense remember that a file object in Python already behaves quite similarly to a generator, even without using the `func` in your question. You can call `f.readline()` to get one line at a time (and this "saves state" in the sense that the next call to `.readline()` will get the next line - this behaviour is like `next` with a generator), and you can iterate `for line in f:` which loads only one line into memory at a time, just like iterating over a generator which yields one item at a time. – slothrop May 06 '23 at 10:30
@Booboo What I was trying to say is, Suppose we have 2 GBs of RAM and our files are more than 2 GBs( let's say 4 GBs ), then we can not bring the whole file in to the memory right! In that case, we can not create a set or use any other data structure to solve the problem, that is why we can not do any better that `O(n*m)` . – commonSense May 06 '23 at 19:03
And the thing about _halting the task_ was trivial thing it can be ignored, but what my thought process was, suppose there is only one process running in the computer, this program, and if it keeps on bringing one line after another, without un-claiming previous memory, then eventually the memory's going to overflow, in that case if we are using generators we can save the state at which line the processing of the file is halted and we can re-claim the memory and restart the process, from the point where we left off. – commonSense May 06 '23 at 19:03
No: If you cannot bring the entire source programs in memory, first do an external sort of the files using temporary disk storage (e.g. the merge sort algorithm). These generally run `o(n * log(n))`. Then I would read in both files concurrently one line at a time as if you were doing a merge operation. Then it should be easy to spot duplicates. As far as the merge goes, comparing two keys takes a constant time and reading each file is o(n). In the end the time for sorting each file by the sort-merge method and doing a final merge should be roughly `o(n * log(n)) + o(m * log(m)`). – Booboo May 06 '23 at 19:43

score 1 · Answer 2 · answered May 05 '23 at 15:20

As far as I know, there is no python-specific approach. However, as pointed out in other answers, set.intersection is efficient way to do it, so chunked loading is also worth a try.

def chunked_reader(fo: TextIO, chunk_size: int):
    while True:
        lines = fo.readlines(chunk_size)
        if not lines:
            break
        # Uncomment this if the trailing newline is an issue.
        # lines = [l.strip("\n") for l in lines]
        yield lines


def intersect_files(path1: str, path2: str, chunk_size: int):
    results = []
    with open(path1) as f1, open(path2) as f2:
        for l1 in chunked_reader(f1, chunk_size):
            f2.seek(0)
            l1 = set(l1)
            for l2 in chunked_reader(f2, chunk_size):
                results.extend(l1.intersection(set(l2)))
    return results

intersect_files reads chunk_size bytes from both files and compares them. The smaller the chunk_size, the less memory used; the larger the chunk_size, the fewer repeated file reads. If chunk_size is larger than the file, both files are read only once.

However, note that the order of the results will change depending on the chunk_size.

Thanks for this! I think chunk size will do the optimization at hardware level, but the overall complexity in terms of algorithm will still be `O(n*m)`. — commonSense, May 06 '23 at 02:14
In the worst case, yes. But that is the case when all lines have the same hash but different values. Obviously, this is an extremely rare condition and will not happen unless it is done intentionally. The average time complexity is `O(max(n, m))` when the chunk size is large enough. — ken, May 06 '23 at 03:36

Alain T. · Answer 3 · 2023-05-05T17:03:45.667

Finding common lines is not exactly the same problem as comparing for differences. Comparing differences would have to take the order of lines into account, but common lines is merely a set membership problem.

You can actually use a set to implement it:

with open(old) as f:
    oldLines = set(f)  # O(m) space

with open(new) ad f:
    common = (line for line in f if line in oldLines)  # O(n) time

If you need to count repetitions in the old and new as distinct occurrences, then use the Counter class from collections instead of a set.

from collections import Counter

with open(old) as f:
    oldLines = Counter(f)  # O(m) space

with open(new) ad f:
    common = ( oldLines.subtract([line]) or line for line in f 
               if oldLines[line] > 0 )

I wasn't actually solve the problem of finding common lines or different lines, but was trying to understand how actually generators can be used to solve those problems. — commonSense, May 06 '23 at 02:19

Brock Brown · Answer 4 · 2023-05-05T14:12:49.073

0

Edit: This answer is suboptimal for space complexity, see Booboo's answer.

If I were you, I would use a set, they can do membership testing in (almost) constant time. some_set.intersection(some_other_set) yields their common members. It's really fast because they use hashtables.

edited May 05 '23 at 14:12

answered May 05 '23 at 13:07

Brock Brown

956
5
11

While this *may* be faster then OP's "double loop" approach, it does have the [possible] disadvantage that the entire contents of both files would need to be in memory concurrently – DarkKnight May 05 '23 at 13:27
@DarkKnight Good point. Booboo's answer solves this pretty well, retaining both `O(n)` time and `O(n)` space complexity. – Brock Brown May 05 '23 at 13:34

Generator function to find common lines between two files

Big questions

4 Answers4