How to diff the two files using Python Generator

Question

I have one file of 100GB having 1 to 1000000000000 separated by new line. In this some lines are missing like 5, 11, 19919 etc. My Ram size is 8GB.

How to find the missing elements.

My idea take another file for i in range(1,1000000000000) read the lines one by one using the generator. can we use yield statement for this

Can help in writing the code

My Code, the below code taking as a list in does the below code can use it for production.?

def difference(a,b):
    with open(a,'r') as f:
        aunique=set(f.readlines())


    with open(b,'r') as f:
        bunique=set(f.readlines())

    with open('c','a+') as f:
        for line in list(bunique - aunique):
            f.write(line)

Hi, we're not here to write code for you. Please show your attempt, and we can help you fixing it! — tituszban, Aug 21 '19 at 12:07
If you really need to use python, search for `difflib`, if not, just use diff. — Joao Vitorino, Aug 21 '19 at 12:13
Possible duplicate of [Compare two files report difference in python](https://stackoverflow.com/questions/19120489/compare-two-files-report-difference-in-python) — techkuz, Aug 21 '19 at 12:25
A solution using `difflib`: [diff two big files in Python](https://stackoverflow.com/questions/4899146/diff-two-big-files-in-python) — hoefling, Aug 30 '19 at 00:58

ilmiacs · Answer 1 · 2019-08-28T18:43:31.207

5

If the values are in sequential order, you can simply note the previous value and see if the difference equals one:

prev = 0
with open('numbers.txt','r') as f:
    for line in f:
        value = int(line.strip())
        for i in range(prev, value-1):
            print('missing:', i+1)
    prev = value
# output numbers that are missing at the end of the file (see comment by @blhsing)
for i in range(prev, 1000000000000):
    print('missing:', i+1)

This should work fine in python3, as readlines is an iterator so will not load the full file at once or keep it in memory.

edited Aug 28 '19 at 18:43

answered Aug 21 '19 at 12:32

ilmiacs

2,566
15
20

The `if value - prev > 1:` statement is redundant. Also, using the `readlines` method unnecessarily stores all the lines into memory when iterating over the file object as a generator will do. – blhsing Aug 28 '19 at 18:10
Also, it will not output numbers that are missing at the end of the file. – blhsing Aug 28 '19 at 18:17
1

@blhsing Thanks for your fantastic comments. In fact I was not aware that `readlines()` parses the whole file and returns a list. I am shocked (I was convinced this changed in python3). The docs say to just use `for line in f`. I am going to update my answer accordingly. – ilmiacs Aug 28 '19 at 18:39
1

Yes that is why we print i+1 – ilmiacs Aug 28 '19 at 18:50
While this answer solves the problem, I'm holding back from upvoting it since the line `print('missing:', i+1)` appearing twice violates the DRY ([Don't Repeat Yourself](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)) principle, so the solution is not as clean as it can be. – blhsing Aug 28 '19 at 18:57
@blhsing While you are right that we have one line repeated, I would argue that the alternatives in this particular case might end up with somewhat convoluted and less readable code, or introduce superficial definitions. Your interpretation of the DRY principle is too strict for my personal taste. Also regarding your comment to the redundant statement `if value - prev > 1:` (which I removed) one could argue that keeping it may improve readability by explicitness (and is potentially faster by avoiding the call to `range()` on every iteration (I did not test this, though)). – ilmiacs Aug 28 '19 at 19:42
1

No problem. What I said is certainly a matter of taste. In cases like this I would prefer letting the generator that produces the complete sequence drive the loop so that the DRY principle can be adhered to, as you can see from my answer. Either approach works though. – blhsing Aug 28 '19 at 19:49

blhsing · Accepted Answer · 2019-08-28T19:08:25.813

4

You can iterate over all the numbers generated by range and keep comparing the number to the next number in the file. Output the number if it's missing, or read the next number for the next match:

with open('numbers') as f:
    next_number = 0
    for n in range(1000000000001):
        if n == next_number:
            next_number = int(next(f, 0))
        else:
            print(n)

Demo (assuming target numbers from 1 to 10 instead): https://repl.it/repls/WaterloggedUntimelyCoding

edited Aug 28 '19 at 19:08

answered Aug 28 '19 at 18:52

blhsing

91,368
6
71
106

1

Very elegant solution. – ilmiacs Aug 28 '19 at 19:58

GZ0 · Answer 3 · 2019-09-03T15:53:45.727

Assume the numbers in the file are already sorted, this is an improved version of @ilmiacs's solution.

def find_missing(f, line_number_ub):
    missing = []
    next_expected = 1
    for i in map(int, f):
        # The logic is correct without if, but adding it can greatly boost the 
        # performance especially when the percentage of missing numbers is small
        if next_expected < i:
            missing += range(next_expected, i)
        next_expected = i + 1
    missing += range(next_expected, line_number_ub)
    return missing

with open(path,'r') as f:
    print(*find_missing(f, 10**12), sep='\n')

If a generator is preferred over a list, you can do

def find_missing_gen(f, line_number_ub):
    missing = []
    next_expected = 1
    for i in map(int, f):
        if next_expected < i:
            yield from range(next_expected, i)
        next_expected = i + 1
    yield from range(next_expected, line_number_ub)

with open(path,'r') as f:
    print(*find_missing_gen(f, 10**12), sep='\n')

And following is some performance measurement using a list of strings from 1 to 9999 with 100 missing values (randomly selected):

(find_missing) 2.35 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing w/o if) 4.67 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(@blhsing's solution) 3.54 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen) 2.35 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen w/o if) 4.42 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You may do some prelimary tests on your machine to see the performance of handling 1GB files in order to estimate whether the performance of handling 100GB files reaches your requirement. If not, you could consider further optimizations such as reading the file in blocks and using more advanced algorithms to find the missing numbers.

TornaxO7 · Answer 4 · 2019-09-02T11:12:10.420

Here is my "solution-code":

with open("yourPath.txt", "r") as file:
    distance = 0

    for index, element in enumerate(file):
        element = int(element)  # You don't need it, if your numbers are already intengers
        index += distance
        if index != element:
            distance += element - index
            [print(f"{index + missingNumbers} is missing!") for missingNumbers in range(0, element - index)]

(Short) Explanation
Example case:
Let's say you have this list: [1, 2, 3, 5, 6, 9]
The if-clause is going to be "activated" if it reaches 5!
At this moment element will be 5 and index will be 4, because index is the number which should be at this index. As you can see index is one number lower than element. As a result index has to be after that always 1 number higher, because index is going to be always 1 number lower than before. And if element is higher than index again ( in this example 9), distance becomes the distance between index and element and all numbers between 6 and 9 are going to printed out.

Note:
My english isn't the best... so feel free to edit it :)

How to diff the two files using Python Generator

4 Answers4

Linked