Not sure if it's too late already, but here it comes.
I see you are loading 2 arrays in memory, with your full files. If you say they are about 3 GB each, that is trying to populate 6 GB in RAM and probably entering swap.
Furthermore, even if you succeed in loading the files, then you are attempting ~ L1xL2 string comparisons (L1 and L2 being the line counts).
I have run the following code in a 1.2 GB (3.3 million lines) and completes in seconds. It uses string hashes, and only loads in RAM a set of L1 integer32.
The trick is done here, creating a set() after applying the hashstring function to every line in the file (except the header, which you seem to be adding to the output).
file1 = set(map(hashstring, f1))
Please note I am comparing the file against itself (f2 loads the same file as f1). Let me know if it helps.
from zlib import adler32
def hashstring(s):
return adler32(s.encode('utf-8'))
with open('haproxy.log.1', 'r') as f1:
heading = f1.readline()
print(f'Heading: {heading}')
print('Hashing')
file1 = set(map(hashstring, f1))
print(f'Hashed: {len(file1)}')
with open('updates.log', 'w') as outFile:
count = 0
outFile.write(heading)
with open ('haproxy.log.1', 'r') as f2:
for line in f2:
if hashstring(line) not in file1:
outFile.write(line)
count += 1
if 0 == count % 10000:
print(f'Checked: {count}')