How to compare 2 txt files in Python

Question

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file.

Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')   
for line1 in file1:    
    for line2 in file2:    
        if line2 != line1:    
            NewFile.write(line1)    
file1.close()    
file2.close()
NewFile.close()

If you add some `print`s in, you will see the mistake you have made... — jonrsharpe, Jul 30 '15 at 10:38
Are the lines in your files in order? Are either of the files very long (e.g. too long to keep all in memory at once)? — Blckknght, Jul 30 '15 at 10:39
You should have a look to [`filecmp`](https://docs.python.org/2/library/filecmp.html) and [`difflib`](https://docs.python.org/2/library/difflib.html) — clemtoy, Jul 30 '15 at 10:45

Markus Meskanen · Accepted Answer · 2015-07-30T11:11:38.300

3

Here's an example using the with statement, supposing the files are not too big to fit in the memory

# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines from 'new2.txt' and store them into a python set
    lines = set(f2.readlines())

    # Loop through each line in 'new1.txt'
    for line in f1:

        # If the line was not in 'new2.txt'
        if line not in lines:

            # Write the line to the output file
            outf.write(line)

The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:

with open('temp.log') as temp:
    temp.write('Temporary logging.')

# equal to:

temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()

Yet an other way using two sets, but this again isn't too memory effecient. If your files are big, this wont work:

# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines in 'new1.txt' and 'new2.txt'
    s1, s2 = set(f1.readlines()), set(f2.readlines())

    # `s1 - s2 | s2 - s2` returns the differences between two sets
    # Now we simply loop through the different lines
    for line in s1 - s2 | s2 - s1:

        # And output all the different lines
        outf.write(line)

Keep in mind, that this last code might not keep the order of your lines

edited Jul 30 '15 at 11:11

answered Jul 30 '15 at 10:46

Markus Meskanen

19,939
18
80
119

@VigneshKalai Which is why it says "supposing the files are not too big to fit in the memory" in my answer. If his file2 is too big, this is not the correct answer for him. – Markus Meskanen Jul 30 '15 at 10:48
Sorry did not see that :P – The6thSense Jul 30 '15 at 10:49
@Markus, the code u gave me worked for me...but have few doubts... when we would need to use 'with; .... and what exactly the line of ur code does? – Maverick Jul 30 '15 at 10:55
@Maverick Look at the second code in my answer, that's where I explain what `with` does. It simply closes the file automatically, so personally I'd use `with` every time I open a file. You can still use the old `file = open('bla'); file.close()` method, but I think `with` is easier and it makes sure you never forget to close the file. – Markus Meskanen Jul 30 '15 at 10:57
@markus,which means if 'with' is use we dont need to add line to close the file? – Maverick Jul 30 '15 at 11:02
@Maverick Yes, exactly. – Markus Meskanen Jul 30 '15 at 11:05
2

@Maverick In addition, the `with` statement will close your file even in the case of an un-handled exception. – 301_Moved_Permanently Jul 30 '15 at 11:06
@MarkusMeskanen ,still dont understand the line lines = set(f2.readlines())..it store the whole lines in the "lines" ? – Maverick Jul 30 '15 at 11:20
f2.readlines() reads all the lines from f2 and store them in a list. A set is then created from this list. – Dyrborg Jul 30 '15 at 11:22
@Maverick `f2.readlines()` simply reads all the lines in `f2` and returns them as a list. Now we convert the returned list to a `set` simply because sets are faster for comparison than lists. Finally we store the set into `lines` variable. So the variable `lines` now contains all the lines of your file `'new2.txt'`. – Markus Meskanen Jul 30 '15 at 11:22
@markus,in the above given code,can we give lines = f.readlines() instead of giving as lines = set(f.readlines() ?...i tested..both are giving the same result.... – Maverick Jul 30 '15 at 18:53
@Maverick Yes, of course! :) It's simply faster to check if an element is in a set than it is to check if the element is in a list. But lists work just fine if you're not worried about performance :) but in a program like this, the difference is just milliseconds, you wont notice any. – Markus Meskanen Jul 30 '15 at 19:42
thnx again Markus for the explanation – Maverick Jul 31 '15 at 05:31

score 1 · Answer 2 · answered Jul 30 '15 at 10:41

For example you got file1: line1 line2

and file2: line1 line3 line4

When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.

Unknown · Answer 3 · 2015-07-30T12:01:35.003

It is because of your for loops.

If I understand well, you want to see what lines in file1 are not present in file2.

So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.

It could look like something as below:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')

for line1 in file1:
    if line1 not in file2:
        NewFile.write(line1)

file1.close()
file2.close()
NewFile.close()

score 1 · Answer 4 · edited May 23 '17 at 12:09

1

If your file is a big one .You could use this.for-else method:

the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match

Modification:

with open('new1.txt') as file1,  open('diff.txt', 'w') as NewFile :  
    for line1 in file1:    
       with open('new2.txt') as file2:
           for line2 in file2:    
               if line2 == line1: 
                   break
           else:
               NewFile.write(line1)

For more on for-else method see this stack overflow question for-else

edited May 23 '17 at 12:09

Community

1
1

answered Jul 30 '15 at 10:57

The6thSense

8,103
8
31
65

This won't work properly, as once you've read all the lines from `file2`, the iterator will be exhausted and you won't get any more lines to compare with the next line from `file1`. You'd need to make a new `with` statement inside the outer loop if you want to read `new2.txt` over and over. – Blckknght Jul 30 '15 at 11:46

Dyrborg · Answer 5 · 2015-07-30T13:21:54.543

1

I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).

with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
    s1 = set(file1)
    s2 = set(file2)
    for e in s1:
        if e not in s2:
            diff.write(e)

edited Jul 30 '15 at 13:21

answered Jul 30 '15 at 11:05

Dyrborg

877
7
16

there is no need to read line by line an then add it to set you could do this `s1=set(file1.readlines())` like wise for `s2` – The6thSense Jul 30 '15 at 11:06
though it is fewer lines of code, it is also an extra iteration. No need for that. – Dyrborg Jul 30 '15 at 11:07
Actually your method would be slower compared to this because you would create an empty set then append to it as you go you should calculate the time to append – The6thSense Jul 30 '15 at 11:13
Due to adding one element at a time – The6thSense Jul 30 '15 at 11:15
I am not appending I am adding. Its a set, so the elements are added in O(1) time. When you create a set from a list, it doesn't magically convert the list to a set, it has to add each element from the list to the set by hashing each value. – Dyrborg Jul 30 '15 at 11:17
Besides this you also have to consider the large number of mallocs you force Python to call by first creating a list, and THEN insert them into a set. The method simply just allocates the set, and makes better use of the disc buffer as data is read on to go, and not in a massive chunk. See http://stupidpythonideas.blogspot.dk/2013/06/readlines-considered-silly.html – Dyrborg Jul 30 '15 at 11:27
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/84677/discussion-between-vignesh-kalai-and-dyrborg). – The6thSense Jul 30 '15 at 11:28

score 0 · Answer 6 · answered Jul 30 '15 at 10:45

0

Your loop is executed multiple times. To avoid that, use this:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):    
        if line2 != line1:    
            NewFile.write(line1)
file1.close()    
file2.close()
NewFile.close()

answered Jul 30 '15 at 10:45

Suresh Subedi

660
2
10
25

What if `file2` is simply missing the first line, but all rest of the files are equal? – Markus Meskanen Jul 30 '15 at 10:48
You are right. This example doesn't handle that case. But it would require lots of effort to diff two files correctly. You have to account for multiple lines with same content, line order etc. – Suresh Subedi Jul 30 '15 at 10:52
what is the functionality of izip? – Maverick Jul 30 '15 at 10:58
https://docs.python.org/3.5/library/functions.html#zip use zip if you are using python3. it goes through multiple files/lists at the same time. – Suresh Subedi Jul 30 '15 at 11:01

score 0 · Answer 7 · answered Jul 30 '15 at 10:47

0

Print to the NewFile, only after comparing with all lines of file2

present = False
for line2 in file2:    
    if line2 == line1:
        present = True
if not present:
    NewFile.write(line1)

answered Jul 30 '15 at 10:47

kampta

4,748
5
31
51

1

See the [`for ... else`](https://docs.python.org/2/reference/compound_stmts.html#for) construct for a built-in alternative. – 301_Moved_Permanently Jul 30 '15 at 11:04

score 0 · Answer 8 · answered Jul 30 '15 at 12:16

You can use basic set operations for this:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    diffs.writelines(set(f1).difference(f2))

According to this reference, this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update(). This has complexity O(n) where n is the number of lines in the second file. For example:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    s = set(f1)
    s.difference_update(f2)
    diffs.writelines(s)

How to compare 2 txt files in Python

8 Answers8