17

There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,

Input:

FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]

Output:

[2, 5, ...]

The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.

So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.

set(contentsofFileA)-set(contentsofFileB)

But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.

Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.

The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?

krxat
  • 513
  • 4
  • 16
  • 1
    Are the files already in sorted ordered? – Chris Doyle Jul 31 '19 at 09:24
  • Is there a possibility to load at least one file fully in the memory and the other file via batch process ? – venkata krishnan Jul 31 '19 at 09:24
  • @ChrisDoyle The files are sorted seperatley – krxat Jul 31 '19 at 09:30
  • @venkatakrishnan not even one file can be loaded entirely in memory – krxat Jul 31 '19 at 09:31
  • 2
    Are all numbers in file b guarranteed to be in file a? If they are then you can just read them in parallel and continue the read on file b until you get to the next number thats in file a – Sayse Jul 31 '19 at 09:31
  • 1
    Excellent so they are in sorted order when you come to process them? do you need to list numbers that are missing in file A from file B and numbers missing from File B in file A – Chris Doyle Jul 31 '19 at 09:31
  • https://fnc.readthedocs.io/en/latest/api.html#list and https://fnc.readthedocs.io/en/latest/api.html#fnc.sequences.difference this may help – ShivaGaire Jul 31 '19 at 09:32
  • @Sayse all the numbers in FileB not necessarily need to be in FileA – krxat Jul 31 '19 at 09:32
  • @ChrisDoyle Missing numbers in FileB from FileA – krxat Jul 31 '19 at 09:34
  • @GeekSambhu I'm sorry I don't understand completely. But if you are suggesting to load them into an array, that is not possible in this situation as the memory will overload and memory error would occur. – krxat Jul 31 '19 at 09:36
  • Are all the numbers in the file on the same line? Can you provide a bit more information/context? – AMC Nov 06 '19 at 03:42

5 Answers5

15

If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:

First in your terminal you can do this to generate some test files:

seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt

Then you run this code:

i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
    try:
        if a < b:
            aNotB += [a]
            a = int(next(i1, None))
        elif a > b:
            # bNotA += [a]
            b = int(next(i2, None))
        elif a == b:
            a = int(next(i1, None))
            b = int(next(i2, None))
    except TypeError:
        if not b:
            aNotB += list(i1)
            break
        else:
            # bNotA += list(i1)
            break
print(aNotB)

Output:

[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99] If you want both the result for aNotB and bNotA you can uncomment those two lines.

Timing comparison with Andrej Kesely's answer:

$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py        
python manual_iter.py  0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py        
python heapq_groupby.py  1.11s user 0.00s system 99% cpu 1.116 total
yukashima huksay
  • 5,834
  • 7
  • 45
  • 78
  • I guess this is one way to do it. Basically, you are using two-pointers and iterating over the arrays once so it is in O(n) and since it is sorted I guess this will work. But we are still loading the whole file, isn't it? – krxat Jul 31 '19 at 10:49
  • 1
    @ArjunMuraleedharan no we are loading the files line by line. Iter lazy loads the file. – yukashima huksay Jul 31 '19 at 10:51
  • @ArjunMuraleedharan No, we aren't loading the whole file, that's the point of this algorithm – Andrej Kesely Jul 31 '19 at 10:51
  • 1
    This should be the accepted answer - it's fastest :) – Andrej Kesely Jul 31 '19 at 11:02
  • @yukashimahuksay Thank you so much for your answer. If the arrays were not sorted what would you suggest?? – krxat Jul 31 '19 at 11:51
  • 1
    @ArjunMuraleedharan I think you should post a whole new question because that's really a whole different issue which I believe is impossible to solve in O(n) with your memory limits. – yukashima huksay Jul 31 '19 at 12:06
  • @yukashimahuksay that cannot be done in O(n), but other than sorting you have any ideas on how to approach that problem? – krxat Jul 31 '19 at 12:30
  • @ArjunMuraleedharan You can use heuristic methods with hashing, you can also split B to k chunks, add each chunk to a set, then remove all the elements of each set from A in k iterations over A. – yukashima huksay Jul 31 '19 at 12:35
7

As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.

def strip_read(file):
    return file.readline().rstrip()

in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
    with open("fileB") as B:
        Aline = strip_read(A)
        Bline = strip_read(B)
        while Aline or Bline:
            if Aline < Bline and Aline:
                in_a_not_b.append(Aline)
                Aline = strip_read(A)
            elif Aline > Bline and Bline:
                in_b_not_a.append(Bline)
                Bline = strip_read(B)
            else:
                Aline = strip_read(A)
                Bline = strip_read(B)

print("in A not in B", in_a_not_b, "\nin B not in A", in_b_not_a)

OUTPUT for my sample Files

in A not in B ['2', '5', '7'] 
in B not in A ['6']
Chris Doyle
  • 10,703
  • 2
  • 23
  • 42
  • How is this in any way a new answer? there are already two answers exactly doing this. – yukashima huksay Jul 31 '19 at 10:07
  • 1
    I only see one answer doing this by reading the files line by line. and it makes the assumption that its only interested in the files which are in A but not in B. This code will loop through files of varying length and actually list items which are in A and not in B and also list items which are in B but not in A – Chris Doyle Jul 31 '19 at 10:08
  • 1
    For eacmple in mattias case it will stop comparing when it reaches the end of either file. even if the other file still had entries in it which means those entires which have not been read yet wont be listed as not being in the other file – Chris Doyle Jul 31 '19 at 10:09
  • But that is not what the OP was asking. iter also reads files line by line. – yukashima huksay Jul 31 '19 at 10:09
  • I accept the issue with mattias answer. that answer is wrong for files of varying length. – yukashima huksay Jul 31 '19 at 10:11
  • @yukashimahuksay yes sorry you iter answer will also iterate over the files. but am not sure the point of your argument. my code indeed provides a different answer from any other. It offers an extention and flexability to the OP question but also solves it as it includes answer to either file missing entries. – Chris Doyle Jul 31 '19 at 10:12
  • I accept that it adds an unwanted extension, you need to edit your answer so I can remove my vote. – yukashima huksay Jul 31 '19 at 10:15
  • 1
    Thanks for pointing out my mistake, I have now corrected my answer. – Mattias Jul 31 '19 at 10:16
  • 3
    It was certainly a nice question and its good to see it inspiring minds – Chris Doyle Jul 31 '19 at 10:18
  • You are missing the call to `int`. Otherwise you get problems like `"9" > "10"`. Also, you can open multiple files in one `with` statement. – Graipher Jul 31 '19 at 18:35
5

You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]

from itertools import groupby
from heapq import merge

gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
        continue
    print(v)

Prints:

2
5

EDIT (Reading from files):

from itertools import groupby
from heapq import merge

gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
        continue
    print(v)

Benchmark:

Generating files with 10_000_000 items:

seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt

The script takes ~10sec to complete:

real    0m10,656s
user    0m10,557s
sys 0m0,076s
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • @AndrejKesely Thank you so much, man. But one thing I don't understand is why is it taking too much time for very very big numbers if this is also linear time complexity? – krxat Jul 31 '19 at 11:09
  • @ArjunMuraleedharan Yes, the complexity is O(n). But in my answer I create lots of tuples for each element, for example `(int(v.strip()), 1)` Each tuple is object and creation of this object takes some "little" time. For long files, this time adds up. – Andrej Kesely Jul 31 '19 at 11:11
  • @AndrejKesely yes that makes sense. But even when I try it with some sample array, it is taking more time than the other answer. Any idea? – krxat Jul 31 '19 at 11:56
1

A simple solution based on file reading (asuming that each line hold a number):

results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
        var1 = file1.readline()
        var2 = file2.readline()
        while var1:
            while var1 and var2:
                if int(var1) < int(var2):
                    results.append(int(var1))
                    var1 = file1.readline()
                elif int(var1) > int(var2):
                    var2 = file2.readline()
                elif int(var1) == int(var2):
                    var1 = file1.readline()
                    var2 = file2.readline()
            if var1:
                results.append(int(var1))
                var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]
Mattias
  • 416
  • 4
  • 12
  • You are missing the calls to `int` before the comparisons. Otherwise you get problems like `"9" > "10"`. Also, you can open multiple files in one `with` statement. – Graipher Jul 31 '19 at 18:37
  • Ok, I've fixed my code to handle that case as well. Thanks for the input! @Graipher – Mattias Aug 01 '19 at 06:04
1

This is similar to the classic Knuth Sorting and Searching. You may wish to consider reading stack question, on-line lecture notes pdf, and Wikipedia. The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.

punchcard
  • 419
  • 3
  • 6