Comparison with Python

Question

I have 2 files (1 old and 1 new) that have the same structure that I need to compare and then return data that is unique to the new list.

Each file is TAB delimited and looks something like this (each about 16k lines long):

8445    200807
8345    200807
ect.    ect.

I have a basic understanding of doing comparisons using a loop, but I'm not sure how to compare corresponding columns of data against 2 other corresponding columns.

EDIT: Sorry, there is some confusion on what I want as the result. So if I this as my old file:

8445    200807
8345    200807

And this is my new file:

8445    200807
8445    200809

I want the script to return:

8445    200809

So the pair has to be unique to the new file. If that makes sense.

What should the comparison do? And what is the expected result, a list of common records? — Fred Foo, Aug 14 '13 at 15:06
do you really need to use python? diff seems well suited already (or perhaps `diff | sort | uniq`) — Bonlenfum, Aug 14 '13 at 15:08
you can read using `numpy.loadtxt` and then use the [procedure explained here](http://stackoverflow.com/q/16970982/832621) to keep only the unique rows... — Saullo G. P. Castro, Aug 14 '13 at 15:08
"kick out" can mean either "remove" or "return", unfortunately.. — DSM, Aug 14 '13 at 15:20
@Marcin I need the unique pairs from the new file to return. — m0ngr31, Aug 14 '13 at 15:25
@m0ngr31 Both `8445 200807` and `8445 200809` are unique pairs, but you only want to return one. Why is that? — Marcin, Aug 14 '13 at 15:26
@Marcin: 8445 200807 is in the old file as well, so I don't need it in the new one. — m0ngr31, Aug 14 '13 at 15:29
However, what part of this task has you stumped? What have you tried so far? — Marcin, Aug 14 '13 at 15:34
I'm just not sure how to do it. I am not the world's best python expert in the world if you know what I mean. — m0ngr31, Aug 14 '13 at 15:46
So `comm -13 file1 file2` in Python, basically (and relaxing the constraint that `comm` requires sorted input)? — tripleee, Aug 14 '13 at 15:59

Hans Then · Answer 1 · 2013-08-14T15:46:33.713

2

This is the most straight forward way I can think of. Purists will probably complain it does not use a with statement, so be warned.

def compare_files()
    f1 = open('old')
    f2 = open('new')

    d1 = set()

    for line in f1:
        d1.add(line)

    for line in f2:
        if not line in d1:
            yield line

And use it like this:

 for line in compare_files():
     print "not in old", line,

edited Aug 14 '13 at 15:46

answered Aug 14 '13 at 15:11

Hans Then

10,935
3
32
51

I updated my question to be more clear about what I'm trying to get out of it. – m0ngr31 Aug 14 '13 at 15:19
I have updated my answer, to match your comments, I hope. – Hans Then Aug 14 '13 at 15:48

score 0 · Answer 2 · answered Aug 14 '13 at 15:14

0

I'm going to guess what you want: a set of rows that are common to both files. That's the intersection of the two files, i.e.

with open("file1") as f1, open("file2") as f2:
    rows1 = set(ln.split() for ln in f1)
    rows2 = set(ln.split() for ln in f2)

    for row in rows1 & rows2:
        print("\t".join(row))

This changes the order of the rows, though. If you want the lines that only occur in the first file, then replace & with -.

answered Aug 14 '13 at 15:14

Fred Foo

355,277
75
744
836

I updated my question to be more clear about what I'm trying to get out of it. – m0ngr31 Aug 14 '13 at 15:19
@m0ngr31: that's `rows2 - rows1`. – Fred Foo Aug 14 '13 at 18:00

Comparison with Python

2 Answers2