1

I have file one that is 2.4 millions lines (256mb) and file two that is 32 thousand lines (1.5mb).

I need to go through file two line by line and print matching line in file one.

Pseudocode:

open file 1, read
open file 2, read
open results, write

for line2 in file 2:
    for line1 in file 1:
        if line2 in line1:
            write line1 to results
            stop inner loop

My Code:

p = open("file1.txt", "r")
d = open("file2.txt", "r")
o = open("results.txt", "w")

for hash1 in p:
    hash1 = hash1.strip('\n')
    for data in d:
        hash2 = data.split(',')[1].strip('\n')
        if hash1 in hash2:
            o.write(data)

o.close()
d.close()
p.close()

I am expecting 32k results.

Aref
  • 746
  • 10
  • 27

1 Answers1

0

Your file2 is not too large, so it is perfectly well to load it in memory.

  • Load file2.txt into a set to speed up search process and remove duplicates;
  • Remove empty line from a set;
  • Scan file1.txt line-by-line and write found matches in results.txt.

with open("file2.txt","r") as f:
    lines = set(f.readlines())

lines.discard("\n")

with open("results.txt", "w") as o:
    with open("file1.txt","r") as f:
        for line in f:
            if line in lines:
                o.write(line)

If file2 was larger, we could have split it in chunks and repeat the same for every chunk, but in that case it would be harder to compile the results together

Sergey Nudnov
  • 1,327
  • 11
  • 20