Hi I am learning Python and out of my curiosity, I have written a program to remove the extra words in a file. I am comparing the test in file 'text1.txt. and 'text2.txt' and based upon the test in text1, I am removing the words which were extra in the test2.
# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')
t_l1 = text1.readlines()
t_l2 = text2.readlines()
# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = []
for i in range(len(t_l1)):
w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
print w_t1[j]
#printing to see if the contents were read properly.
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
print w_t2[j]
print 'comparing and deleting the excess variables.'
i = 1
while (i<=len(w_t1)):
if(w_t1[i-1] == w_t2[i-1]):
print w_t1[i-1]
i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop.
else:
w.append(str(w_t2[i-1]))
w_t2.remove(w_t2[i-1])
i = i
print 'The extra words are: '+str(w) +'\n'
print w
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
print item
# opening the file out.txt to write the output.
out = open('out.txt', 'w')
out.write(str(w))
# I am closing the files
text1.close()
text2.close()
out.close()
say text1.txt file has the words "Happy birthday dear Friend" and text2.txt has the words "Happy claps birthday to you my dear Best Friend"
The program should give out the extra words in text2.txt which are "claps, to, you, my, Best"
The above program works fine but what if I have to do this for a file containing millions of words, or million lines ?? Checking each and every word dosen't seems to be a good idea. Do we have any Python pre defined functions for that ??
P.S : Kindly bear with me if this is a wrong question, I am learning python. Very soon I'll stop asking these.