I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt
in a set
and iterating the lines in big.txt
, checking whether they are in that set, will have complexity of O(b)
, with b
being the number of lines in big.txt
.
What you seem to be trying is to reduce this to O(s*logb)
, with s
being the number of lines in small.txt
, by using binary search to check for each line in small.txt
whether it is in big.txt
and removing/overwriting it then.
This would work well if all the lines were in a list
with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek
, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb)
, if you do the following:
- implement binary search, searching not on the lines, but on the characters of the big file
- for each position, backtrack to the last line break, then read the line to get the number
- try again in the lower/upper half as usual with binary search
- if the number is found, replace with as many spaces as there are digits in the number
- repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input
and print
. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of -
(not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek
may depend on file system, file type, etc.).
The bsearch
function could also be generalized to accept another parameter value_function
instead of hardcoding val = int(line)
. Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.