Two files comparsion

Question

I have a really weird problem. I've got three files, which contain one column of numbers. I need to get ONLY unique values from first file, that are not present at second and third files.

I tried Python like:

for e in firstfile:
    if e not in secondfile:
        resultfile.append(e)
return resultfile

And same for third file.

I tried uniq, sort, diff, some awk scripts and comm in linux shell like here: Fast way of finding lines in one file that are not in another?

But the only result i get each time is THE SAME AMOUNT OF LINES AS IT WAS IN FORMER FIRST FILE. I don't get it at all!

Maybe, i've missed something? Maybe it's something with a format? However, i checked it a lot of times. Here are the files: http://dropmefiles.com/BaKGj

P.S. Later i thought there are no unique lines at all, but i checked it manually, some numbers in first file ARE unique.

P.P.S. The format of files is like this:

380500100000 
380500100001 
380500100002 
380500100003 
380500100004    
380500100005 
380500100008 
380500100020 
380500100022 
380500100050    
380500100070 
380500100080

If it's just one column of numbers, you might as well include 20 from eacg so we understand what data you are using. Putting them on dropmefiles does not help people in the future as the file gets removed in 7 days. I would also load the first file, then remove everything loaded from second and third file if it exists in the first file. — IvanD, Jun 24 '16 at 01:06

mhawke · Accepted Answer · 2016-06-24T00:34:20.960

What's wrong

And same for third file

If you are really doing the same for the third file, i.e. comparing the original contents of the first file with the third, you can introduce duplicates of items that were not in the second file but are in the third. For example:

file 1:
1
2
3

file 2:
1

file 3:
2

After processing file 2, resultfile would contain 2 and 3. Then after processing file 3, resultfile would contain 2 and 3 (from the first run) plus 1 and 3, i.e. 2, 3, 1, 3. However, the result should just be 3.

It's not clear from your code whether you are actually writing the output of each run to the file resultfile. If you are, then you should use it as the input for the second and subsequent runs, don't process the first file again.

A better way to fix it

If you do not need to preserve the order of lines from the first file you could use set.difference() like this:

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(f1).difference(f2, f3)

Note that this will include any whitespace (including newline characters) present in the files. If you wanted to ignore leading and trailing whitespace from each line:

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(str.strip, f1)).difference(map(str.strip, chain(f2, f3)))

The above assumes Python 3. If you're using Python 2 then, optionally for better efficiency, import itertools.imap and use it instead of map().

Or you might like to treat the data as numeric (I'll assume float here, but you can use int instead):

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(float, f1)).difference(map(float, chain(f2, f3)))

I see your point, but my code is more complicated than that. I just didn't want to paste all of that so i made it a bit easier to understand. Originally, i opened csv files, made lists from them, then iterated every element of the first list and stored it in the result list. Then i took the result list of first iteration and made the same thing for the third list (file), the result was stored in another list that later was written to the fourth (resulting) csv file. — tiredsys, Jun 24 '16 at 00:40

score 0 · Answer 2 · answered Jun 23 '16 at 23:46

0

The easiest way would be to read each file into a set, and then use Python's (very efficient) set operations to do the comparison.

file1 = set()
file2 = set()

for element in firstfile:
    file1.add(element)

for element in secondfile:
    file2.add(element)

unique = file1 - file2

answered Jun 23 '16 at 23:46

Batman

8,571
7
41
80

I tried sets like this: 1. Opened csv files in Python by csv module. 2. Extracted all data from these file and transfered it to lists. 3. Made sets from those lists. 4. Tried the construction you suggested (unique = file1 - file2). Does it have the same effect or i should try your option? – tiredsys Jun 23 '16 at 23:50
That will work fine. I just used that (inefficient) construction because I wasn't sure how you reading the files into memory. – Batman Jun 23 '16 at 23:56

score 0 · Answer 3 · answered Jun 23 '16 at 23:59

It's likely the issue might be that first.csv is strictly ASCII text, while second.csv, and third.csv are ASCII text, with CRLF line terminators. I would suggest you change them to the same format (ASCII text would probably work best).

$ file first.csv
first.csv: ASCII text 

$ file second.csv
second.csv: ASCII text, with CRLF line terminators

$ file third.csv
third.csv: ASCII text, with CRLF line terminators

Two files comparsion

3 Answers3