-2

I have 2 files (1 old and 1 new) that have the same structure that I need to compare and then return data that is unique to the new list.

Each file is TAB delimited and looks something like this (each about 16k lines long):

8445    200807
8345    200807
ect.    ect.

I have a basic understanding of doing comparisons using a loop, but I'm not sure how to compare corresponding columns of data against 2 other corresponding columns.

EDIT: Sorry, there is some confusion on what I want as the result. So if I this as my old file:

8445    200807
8345    200807

And this is my new file:

8445    200807
8445    200809

I want the script to return:

8445    200809

So the pair has to be unique to the new file. If that makes sense.

m0ngr31
  • 791
  • 13
  • 29

2 Answers2

2

This is the most straight forward way I can think of. Purists will probably complain it does not use a with statement, so be warned.

def compare_files()
    f1 = open('old')
    f2 = open('new')

    d1 = set()

    for line in f1:
        d1.add(line)

    for line in f2:
        if not line in d1:
            yield line

And use it like this:

 for line in compare_files():
     print "not in old", line,
Hans Then
  • 10,935
  • 3
  • 32
  • 51
0

I'm going to guess what you want: a set of rows that are common to both files. That's the intersection of the two files, i.e.

with open("file1") as f1, open("file2") as f2:
    rows1 = set(ln.split() for ln in f1)
    rows2 = set(ln.split() for ln in f2)

    for row in rows1 & rows2:
        print("\t".join(row))

This changes the order of the rows, though. If you want the lines that only occur in the first file, then replace & with -.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836