There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,
Input:
FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]
Output:
[2, 5, ...]
The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.
So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.
set(contentsofFileA)-set(contentsofFileB)
But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.
Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.
The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?