2

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file.

Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')   
for line1 in file1:    
    for line2 in file2:    
        if line2 != line1:    
            NewFile.write(line1)    
file1.close()    
file2.close()
NewFile.close()
The6thSense
  • 8,103
  • 8
  • 31
  • 65
Maverick
  • 119
  • 1
  • 10

8 Answers8

3

Here's an example using the with statement, supposing the files are not too big to fit in the memory

# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines from 'new2.txt' and store them into a python set
    lines = set(f2.readlines())

    # Loop through each line in 'new1.txt'
    for line in f1:

        # If the line was not in 'new2.txt'
        if line not in lines:

            # Write the line to the output file
            outf.write(line)

The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:

with open('temp.log') as temp:
    temp.write('Temporary logging.')

# equal to:

temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()

Yet an other way using two sets, but this again isn't too memory effecient. If your files are big, this wont work:

# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines in 'new1.txt' and 'new2.txt'
    s1, s2 = set(f1.readlines()), set(f2.readlines())

    # `s1 - s2 | s2 - s2` returns the differences between two sets
    # Now we simply loop through the different lines
    for line in s1 - s2 | s2 - s1:

        # And output all the different lines
        outf.write(line)

Keep in mind, that this last code might not keep the order of your lines

Markus Meskanen
  • 19,939
  • 18
  • 80
  • 119
  • @VigneshKalai Which is why it says "supposing the files are not too big to fit in the memory" in my answer. If his file2 is too big, this is not the correct answer for him. – Markus Meskanen Jul 30 '15 at 10:48
  • Sorry did not see that :P – The6thSense Jul 30 '15 at 10:49
  • @Markus, the code u gave me worked for me...but have few doubts... when we would need to use 'with; .... and what exactly the line of ur code does? – Maverick Jul 30 '15 at 10:55
  • @Maverick Look at the second code in my answer, that's where I explain what `with` does. It simply closes the file automatically, so personally I'd use `with` every time I open a file. You can still use the old `file = open('bla'); file.close()` method, but I think `with` is easier and it makes sure you never forget to close the file. – Markus Meskanen Jul 30 '15 at 10:57
  • @markus,which means if 'with' is use we dont need to add line to close the file? – Maverick Jul 30 '15 at 11:02
  • @Maverick Yes, exactly. – Markus Meskanen Jul 30 '15 at 11:05
  • 2
    @Maverick In addition, the `with` statement will close your file even in the case of an un-handled exception. – 301_Moved_Permanently Jul 30 '15 at 11:06
  • @MarkusMeskanen ,still dont understand the line lines = set(f2.readlines())..it store the whole lines in the "lines" ? – Maverick Jul 30 '15 at 11:20
  • f2.readlines() reads all the lines from f2 and store them in a list. A set is then created from this list. – Dyrborg Jul 30 '15 at 11:22
  • @Maverick `f2.readlines()` simply reads all the lines in `f2` and returns them as a list. Now we convert the returned list to a `set` simply because sets are faster for comparison than lists. Finally we store the set into `lines` variable. So the variable `lines` now contains all the lines of your file `'new2.txt'`. – Markus Meskanen Jul 30 '15 at 11:22
  • @markus,in the above given code,can we give lines = f.readlines() instead of giving as lines = set(f.readlines() ?...i tested..both are giving the same result.... – Maverick Jul 30 '15 at 18:53
  • @Maverick Yes, of course! :) It's simply faster to check if an element is in a set than it is to check if the element is in a list. But lists work just fine if you're not worried about performance :) but in a program like this, the difference is just milliseconds, you wont notice any. – Markus Meskanen Jul 30 '15 at 19:42
  • thnx again Markus for the explanation – Maverick Jul 31 '15 at 05:31
1

For example you got file1: line1 line2

and file2: line1 line3 line4

When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.

Raiper34
  • 537
  • 2
  • 6
  • 20
1

It is because of your for loops.

If I understand well, you want to see what lines in file1 are not present in file2.

So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.

It could look like something as below:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')

for line1 in file1:
    if line1 not in file2:
        NewFile.write(line1)

file1.close()
file2.close()
NewFile.close()
Unknown
  • 61
  • 5
1

If your file is a big one .You could use this.for-else method:

the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match

Modification:

with open('new1.txt') as file1,  open('diff.txt', 'w') as NewFile :  
    for line1 in file1:    
       with open('new2.txt') as file2:
           for line2 in file2:    
               if line2 == line1: 
                   break
           else:
               NewFile.write(line1) 

For more on for-else method see this stack overflow question for-else

Community
  • 1
  • 1
The6thSense
  • 8,103
  • 8
  • 31
  • 65
  • This won't work properly, as once you've read all the lines from `file2`, the iterator will be exhausted and you won't get any more lines to compare with the next line from `file1`. You'd need to make a new `with` statement inside the outer loop if you want to read `new2.txt` over and over. – Blckknght Jul 30 '15 at 11:46
1

I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).

with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
    s1 = set(file1)
    s2 = set(file2)
    for e in s1:
        if e not in s2:
            diff.write(e)
Dyrborg
  • 877
  • 7
  • 16
  • there is no need to read line by line an then add it to set you could do this `s1=set(file1.readlines())` like wise for `s2` – The6thSense Jul 30 '15 at 11:06
  • though it is fewer lines of code, it is also an extra iteration. No need for that. – Dyrborg Jul 30 '15 at 11:07
  • Actually your method would be slower compared to this because you would create an empty set then append to it as you go you should calculate the time to append – The6thSense Jul 30 '15 at 11:13
  • Due to adding one element at a time – The6thSense Jul 30 '15 at 11:15
  • I am not appending I am adding. Its a set, so the elements are added in O(1) time. When you create a set from a list, it doesn't magically convert the list to a set, it has to add each element from the list to the set by hashing each value. – Dyrborg Jul 30 '15 at 11:17
  • Besides this you also have to consider the large number of mallocs you force Python to call by first creating a list, and THEN insert them into a set. The method simply just allocates the set, and makes better use of the disc buffer as data is read on to go, and not in a massive chunk. See http://stupidpythonideas.blogspot.dk/2013/06/readlines-considered-silly.html – Dyrborg Jul 30 '15 at 11:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/84677/discussion-between-vignesh-kalai-and-dyrborg). – The6thSense Jul 30 '15 at 11:28
0

Your loop is executed multiple times. To avoid that, use this:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):    
        if line2 != line1:    
            NewFile.write(line1)
file1.close()    
file2.close()
NewFile.close()
Suresh Subedi
  • 660
  • 2
  • 10
  • 25
0

Print to the NewFile, only after comparing with all lines of file2

present = False
for line2 in file2:    
    if line2 == line1:
        present = True
if not present:
    NewFile.write(line1)   
kampta
  • 4,748
  • 5
  • 31
  • 51
0

You can use basic set operations for this:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    diffs.writelines(set(f1).difference(f2))

According to this reference, this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update(). This has complexity O(n) where n is the number of lines in the second file. For example:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    s = set(f1)
    s.difference_update(f2)
    diffs.writelines(s)
mhawke
  • 84,695
  • 9
  • 117
  • 138