10

I have two 3GB text files, each file has around 80 million lines. And they share 99.9% identical lines (file A has 60,000 unique lines, file B has 80,000 unique lines).

How can I quickly find those unique lines in two files? Is there any ready-to-use command line tools for this? I'm using Python but I guess it's less possible to find a efficient Pythonic method to load the files and compare.

Any suggestions are appreciated.

jack
  • 17,261
  • 37
  • 100
  • 125
  • Do you mean that 99.9% of the *files* are identical, or that 99.9% of the *lines* are identical (ie the same line is repeated)? – bstpierre Aug 23 '10 at 02:52
  • Do you care the order of the lines? Does the B have all lines of A in same order as A? Can there be reordering, deletions of lines? Are there repeated lines whose count matters (A has n times, B has n-b times-> difference is b*line) – Tony Veijalainen Aug 23 '10 at 04:50
  • 1
    If you ask about "ready-to-use command line tools", you might want to specify an OS. On most, "diff" is either native or ported. Still, I can't be sure what you want from your question: perhaps on Linux: sort --unique < file1 > uniq1; sort --unique < file2 > uniq1; diff uniq[12]. – Tony Delroy Aug 23 '10 at 05:08
  • How many bytes per line on average? – Daniel Stutzbach Aug 23 '10 at 12:39
  • @bstpierre, exactly, 99.9% lines in the two files are identical but the unique lines are randomly spreaded in two files. – jack Aug 23 '10 at 15:18
  • @Tony Veijalainen, all lines are in random order. I just hope to find all unique lines in file A and all unique lines in file B, any order is ok. There are no repeated lines in both files. – jack Aug 23 '10 at 15:20
  • @Daniel Stutzbach, Average line length is 35 bytes. – jack Aug 23 '10 at 15:25

5 Answers5

7

If order matters, try the comm utility. If order doesn't matter, sort file1 file2 | uniq -u.

zwol
  • 135,547
  • 38
  • 252
  • 361
3

I think this is the fastest method (whether it's in Python or another language shouldn't matter too much IMO).

Notes:

1.I only store each line's hash to save space (and time if paging might occur)

2.Because of the above, I only print out line numbers; if you need actual lines, you'd just need to read the files in again

3.I assume that the hash function results in no conflicts. This is nearly, but not perfectly, certain.

4.I import hashlib because the built-in hash() function is too short to avoid conflicts.

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]
max
  • 49,282
  • 56
  • 208
  • 355
  • 1
    Looks good answer for me, I would only suggest to save the seek position of each line while reading to recover them for result quickly. – Tony Veijalainen Aug 23 '10 at 17:36
2

With 60,000 or 80,000 unique lines you could just create a dictionary for each unique line, mapping it to a number. mydict["hello world"] => 1, etc. If your average line is around 40-80 characters this will be in the neighborhood of 10 MB of memory.

Then read each file, converting it to an array of numbers via the dictionary. Those will fit easily in memory (2 files of 8 bytes * 3GB / 60k lines is less than 1 MB of memory). Then diff the lists. You could invert the dictionary and use it to print out the text of the lines that differ.

EDIT:

In response to your comment, here's a sample script that assigns numbers to unique lines as it reads from a file.

#!/usr/bin/python

class Reader:

    def __init__(self, file):
        self.count = 0
        self.dict = {}
        self.file = file

    def readline(self):
        line = self.file.readline()
        if not line:
            return None
        if self.dict.has_key(line):
            return self.dict[line]
        else:
            self.count = self.count + 1
            self.dict[line] = self.count
            return self.count

if __name__ == '__main__':
    print "Type Ctrl-D to quit."
    import sys
    r = Reader(sys.stdin)
    result = 'ignore'
    while result:
        result = r.readline()
        print result
Community
  • 1
  • 1
Harold L
  • 5,166
  • 28
  • 28
  • @Harold L, I'm confused. How can I map 60,000 or 80,000 unique lines to a dictionary before knowing what lines are contained in both files. – jack Aug 23 '10 at 03:46
  • You can just build the dictionary as you read the files. I'll add code for a helper function above. – Harold L Aug 23 '10 at 06:37
  • dict.keys() with 3 GB? I do not believe that you can save hash only with seff.dict[line] but it saves the whole line in keys + hashes. – Tony Veijalainen Aug 23 '10 at 17:43
  • @Tony Veijalainen, Yes, the dict will save the whole lines, but it will only save each line once. So this technique works well here only because Jack has many duplicate lines: 3GB might be 100 million lines of text, say, but only 80,000 unique lines will be stored in the dictionary's key set. – Harold L Aug 23 '10 at 18:08
  • "There are no repeated lines in both files". See the poster's comment to his post in answer to me. Maybe I do not understand his English properly. – Tony Veijalainen Aug 23 '10 at 19:29
  • Unfortunately, based on the clarification from OP (in response to Tony), the original files do not have repeated lines. The "99.9% identical lines" only refers to the fact that each of those lines is present in both files. With that clarification, your approach won't work, unfortunately. – max Aug 24 '10 at 21:33
1

If I understand correctly, you want the lines of these files without duplicates. This does the job:

uniqA = set(open('fileA', 'r'))
0

Python has difflib which claims to be quite competitive with other diff utilities see: http://docs.python.org/library/difflib.html

Tony Veijalainen
  • 5,447
  • 23
  • 31
  • Can this lib handle 3gb text files?! Even good databases have hard time with this kind of task... They need indexing and other optimization to get the result in reasonable time. – Asaf Aug 23 '10 at 12:11
  • As the lines are in random order and there is not need to find the changes of lines, probably not best approach. Would be appropriate if two files are versions of same file (was possibility because of high similarity in lines between them). – Tony Veijalainen Aug 23 '10 at 17:38