Removing extra words in a file using python

Question

Hi I am learning Python and out of my curiosity, I have written a program to remove the extra words in a file. I am comparing the test in file 'text1.txt. and 'text2.txt' and based upon the test in text1, I am removing the words which were extra in the test2.

# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')

t_l1 = text1.readlines()
t_l2 = text2.readlines()

# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = [] 
for i in range(len(t_l1)):
    w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
    print w_t1[j]
#printing to see if the contents were read properly. 
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
    w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
    print w_t2[j]


print 'comparing and deleting the excess variables.'

i = 1
while (i<=len(w_t1)):
    if(w_t1[i-1] == w_t2[i-1]):
        print w_t1[i-1]
        i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop. 
    else: 
        w.append(str(w_t2[i-1]))
        w_t2.remove(w_t2[i-1])
        i = i
print 'The extra words are: '+str(w) +'\n'
print w 
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
    print item
# opening the file out.txt to write the output. 
out = open('out.txt', 'w')
out.write(str(w))

# I am closing the files
text1.close()
text2.close()
out.close()

say text1.txt file has the words "Happy birthday dear Friend" and text2.txt has the words "Happy claps birthday to you my dear Best Friend"

The program should give out the extra words in text2.txt which are "claps, to, you, my, Best"

The above program works fine but what if I have to do this for a file containing millions of words, or million lines ?? Checking each and every word dosen't seems to be a good idea. Do we have any Python pre defined functions for that ??

P.S : Kindly bear with me if this is a wrong question, I am learning python. Very soon I'll stop asking these.

score 3 · Accepted Answer · edited May 23 '17 at 10:26

It seems a 'Set' problem. First add your words in a set structure:

textSet1 = set()
with open('text1.txt','r') as text1:
   for line in text1:
      for word in line.split(' '):
         textSet1.add(word)

textSet2 = set()
with open('text2.txt','r') as text2:
   for line in text2:
      for word in line.split(' '):
         textSet2.add(word)

then simply apply set difference operator

textSet2.difference(textSet1)

that give you this result

set(['claps', 'to', 'you', 'my', 'Best'])

You can obtain a list from previous structure in this way

list(textSet2.difference(textSet1))

['claps', 'to', 'you', 'my', 'Best']

Then, how you can read here you shouldn't worry about large files size because with the given loader

When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else

More about lazy file loading here.

Finally, in a real problem I suppose there is a first set (bad words) that have a relative small size and a second set with a huge amount of data. If this is the case then you can avoid the creation of second set:

diff = []
with open('text2.txt','r') as text2:
   for line in text2:
      for word in line.split(' '):
         if word in textSet1:
             diff.append(word)

This is the right idea conceptually but for some inputs, memory could be exhausted. In this situation though, I don't think that will happen. — dilbert, May 04 '14 at 06:34
Thanks @Salvatore Avanzo :) so these kind of problems should be dealt using sets. So do I need to import any library for this ?? — user3543477, May 04 '14 at 22:29

Removing extra words in a file using python

1 Answers1