I am trying to find an efficient way to compare the content in several text files and find the duplicate lines in them.
I started with nested loop first and it worked.
def process_files(self,directory):
files=os.listdir(directory)
files=[os.path.join(directory, file) for file in files]
for i in range(len(files)):
file1=files[i]
fh1=open(file1, 'r')
file1_raw = fh1.read()
if i+1 <len(files):
for i in range(len(files[1:])):
file2=files[i+1]
fh2=open(file2, 'r')
file2_raw = fh2.read()
file1_words = file1_raw.split()
file2_words = file2_raw.split()
for w in file2_words:
if w in file1_words:
print (w)
Then, I found it very slow as the files are large. So, I tried to use pool workers and finds a way around that. I tried to implement the idea mentioned in here. However, I can't get it to work properly.
I have one requirement: I don't want to compare a file against itself. Which should be considered in zip.
If someone can give some idea in this matter, will be much appreciated. Thanks.