I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python.
What I mean by variants:
"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"
So, I tried doing it manually using the following code:
infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')
word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]
for line in infile1:
tempList = line.split()
if any((el in tempList for el in word_list)):
continue
else:
outfile1.write(line)
It didn't work out well, some of the words mentioned in word_list
were still present in the output file. There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc).
I was wondering if there is a way to do it more efficiently (with RE may be!).
EDIT 1:
Input: Sample.txt:
I want my book.
I need my books.
Why you need a book?
Let's go read.
Coming to library
I need to remove all the lines containing "book.","books.", "book?"
from my sample.txt file.
Output: Fixed.txt:
Let's go read
Coming to library
NOTE: The original corpus has around 60,000 lines