1

I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python.

What I mean by variants:

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

So, I tried doing it manually using the following code:

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

It didn't work out well, some of the words mentioned in word_list were still present in the output file. There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc).

I was wondering if there is a way to do it more efficiently (with RE may be!).

EDIT 1:

Input: Sample.txt:

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

I need to remove all the lines containing "book.","books.", "book?" from my sample.txt file.

Output: Fixed.txt:

Let's go read

Coming to library

NOTE: The original corpus has around 60,000 lines

CDspace
  • 2,639
  • 18
  • 30
  • 36
ObiWan
  • 196
  • 1
  • 12

1 Answers1

2

You can set a flag for every line and emit based on the flag value, something like this :

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

This results in :

["Let's go read.", 'Coming to library']
Satish Prakash Garg
  • 2,213
  • 2
  • 16
  • 25
  • You could make it additionally robust by forcing `input_sample` to lowercase (i.e. `input_sample.lower()`). This would eliminate the number of variations needed in `words`, as seen in OP's "yay" "Yay" example. – Jammeth_Q Apr 04 '17 at 16:58
  • @Jammeth_Q Edited. Thanks. – Satish Prakash Garg Apr 04 '17 at 16:59
  • @SatishGarg It is working for certain words but it fails to capture: `book?, book:, "book", book':, book!'.` There are way too many words like this. I was looking for a generic solution which removes every word which has, say `book` in it, like: `bookshelf`, `book-matching`. etc – ObiWan Apr 04 '17 at 17:10
  • @ssokhey The `input_sample` i used does contain `book?`, etc. and it is filtered. check `result` – Satish Prakash Garg Apr 04 '17 at 17:11
  • @SatishGarg I think it will work. Actually if I use `words = ['book']`, it is not filtering 'Books! Books etc.)' but if I use `words = ['Book']`, it is filtering everything. Why is that? – ObiWan Apr 04 '17 at 17:17
  • @ssokhey I am getting same result with both `['Book']` and `['book']` as `.lower()` is used so case does not matter. Every word and line will be converted to lower-case – Satish Prakash Garg Apr 04 '17 at 17:19
  • @ssokhey Added comments. – Satish Prakash Garg Apr 04 '17 at 17:34
  • Another possibly worthwhile addition would be to remove punctuation from each line: https://stackoverflow.com/questions/16050952/how-to-remove-all-the-punctuation-in-a-string-python – Jammeth_Q Apr 04 '17 at 19:52