-1

I have n text files with different name in a folder and I want to compare text present in the file with each other and if they are same then keep those in separate folder and delete from main folder. Can anyone help me?

My code so far:

file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
    for line2 in list2:
        for line3 in list3:
            for line3 in list4:
                for line4 in list5:
                    if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
                        print line1
                        file3.write(line1)
Manuj
  • 21
  • 2
  • 3
    You should post your tried code and yes this is possible in python. – Nikhil Parmar Jan 13 '16 at 08:55
  • 2
    You could calculate a hash of the files and compare just the hash values. You might want to show us which effort you spend on solving your problem. – kotlet schabowy Jan 13 '16 at 08:56
  • I tried following code file1=open("F1.txt","r") file2=open("F2.txt","r") file3=open("F3.txt","r") file4=open("F4.txt","r") file5=open("F5.txt","r") list1=file1.readlines() list2=file2.readlines() list3=file3.readlines() list4=file4.readlines() list5=file5.readlines() for line1 in list1: for line2 in list2: for line3 in list3: for line3 in list4: for line4 in list5: if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip(): print line1 file3.write(line1) – Manuj Jan 14 '16 at 09:11
  • The above code it not solving my purpose – Manuj Jan 14 '16 at 09:21

1 Answers1

0

See see if two files have the same content in python

For comparing, you can use the filecmp module (http://docs.python.org/library/filecmp.html):

>>> import filecmp
>>> filecmp.cmp('F1.txt, 'F2.txt')
True
>>> filecmp.cmp('F1.txt', 'F3.txt')
False

So one way to tackle it would be (not at all elegant but it does work):

import filecmp
files = ['F1.txt', 'F2.txt', 'F3.txt', 'F4.txt', 'F5.txt']
comparisons = {}
for itm in range(len(files)):
    try:
        res = filecmp.cmp(files[itm], files[itm+1])
        comparisons[str(files[itm]) + ' vs ' + str(files[itm+1])] = res
    except:
        pass
    try:
        res = filecmp.cmp(files[itm], files[itm+2])
        comparisons[str(files[itm]) + ' vs ' + str(files[itm+2])] = res
    except:
        pass
    try:
        res = filecmp.cmp(files[itm], files[itm+3])
        comparisons[str(files[itm]) + ' vs ' + str(files[itm+3])] = res
    except:
        pass
    try:
        res = filecmp.cmp(files[itm], files[itm+4])
        comparisons[str(files[itm]) + ' vs ' + str(files[itm+4])] = res
    except:
        pass
print(comparisons)

Gives:

{'F1.txt vs F2.txt': True, 'F1.txt vs F5.txt': False, 'F2.txt vs F4.txt': True, 
 'F3.txt vs F4.txt': False, 'F1.txt vs F4.txt': True, 'F2.txt vs F3.txt': False, 
 'F2.txt vs F5.txt': False, 'F1.txt vs F3.txt': False, 'F3.txt vs F5.txt': False, 
 'F4.txt vs F5.txt': False}

As for the other part of your question, you can use the built-in shutil and os modules like so:

import shutil
import os
if filecmp.cmp('F1.txt', 'F2.txt') is True:
    shutil.move(os.path.abspath('F1.txt'), 'C:\\example\\path')
    shutil.move(os.path.abspath('F2.txt'), 'C:\\example\\path')

UPDATE: better answer, modified from @zalew's answer : https://stackoverflow.com/a/748879/5247482

import shutil
import os
import hashlib
def remove_duplicates(dir):
    unique = []
    for filename in os.listdir(dir):
        if os.path.isfile(dir+'\\'+filename):
            print('--Checking ' + dir+'\\'+filename)
            filehash = hashlib.md5(filename.encode('utf8')).hexdigest()
            print(filename, ' has hash: ', filehash)
            if filehash not in unique: 
                unique.append(filehash)
            else:
                shutil.move(os.path.abspath(filename), 'C:\\example\\path\\destinationfolder')
    return
remove_duplicates('C:\\example\\path\\sourcefolder')
Community
  • 1
  • 1
  • Will it work for multiple files like 100 or 200? @Jon – Manuj Jan 14 '16 at 19:05
  • Yes, but you will have 100 or 200 separate try/except statements.. better to have a recursive for loop –  Jan 14 '16 at 19:22
  • I would recommend this answer http://stackoverflow.com/a/748879/5247482 from @zalew ... then you would just have another line `remove_duplicates(c:\\example\\path')` ... and edit the last line per my answer above `shutil.move(...)` –  Jan 14 '16 at 19:23
  • the files are not identical so I have to group similar files..will this work that case also? @http://stackoverflow.com/users/5247482/jon – Manuj Jan 15 '16 at 07:16
  • Not sure what you mean by that.. This will compare all files in a folder with all the other files in that folder and move the duplicates to a location you specify. –  Jan 15 '16 at 17:30
  • Oh I see ... you want to know if the files are similar (rather than exactly identical)? –  Jan 15 '16 at 17:44
  • Actually my files are not similar ...I can say that some words/lines are similar in the file which will be matching the other file. – Manuj Jan 18 '16 at 06:05
  • The above will only work for exactly identical files. If they are similar, it will depend how similar... is it one word matching, or 50% of the words matching? Either way, the answers on your other question http://stackoverflow.com/q/34806231/5247482 will be a better solution for you –  Jan 18 '16 at 13:25
  • Can we use vector space model for this...? – Manuj Jan 19 '16 at 05:40
  • Not familiar with that, but looks like it: http://stackoverflow.com/a/8716778/5247482 –  Jan 19 '16 at 13:20