How to remove set of words and their variants (or inflections) from a text file?

Question

I am trying to remove lines from a text file that contains certain words and their variants (I'm afraid it's the correct word) using python.

What I mean by variants:

"Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"

So, I tried doing it manually using the following code:

infile1 = open("file1.txt",'r')
outfile1 = open("file2.txt",'w')

word_list = ["Yay","yay",'“Yay','Yay”',"Yay;","Yay?","Yay’s","Yay's",'Yay!','Yay.',"Yay”;"]

for line in infile1:
    tempList = line.split()
    if any((el in tempList for el in word_list)):
        continue
    else:
        outfile1.write(line)

It didn't work out well, some of the words mentioned in word_list were still present in the output file. There are lots of more word variants to consider (like God, God!, book, Book, books, books? etc).

I was wondering if there is a way to do it more efficiently (with RE may be!).

EDIT 1:

Input: Sample.txt:

I want my book.

I need my books.

Why you need a book?

Let's go read.

Coming to library

I need to remove all the lines containing "book.","books.", "book?" from my sample.txt file.

Output: Fixed.txt:

Let's go read

Coming to library

NOTE: The original corpus has around 60,000 lines

@ColonelBeauvel Does the above minimal input/output makes sense? — ObiWan, Apr 04 '17 at 16:19
@SatishGarg I guess stemmer won't remove something like `book;!.` — ObiWan, Apr 04 '17 at 16:20
@MooingRawr I didn't get your question. Can you elaborate please. — ObiWan, Apr 04 '17 at 16:21
@ssokhey if the sentence had `pre'book'` in the sentence, would that line be removed ? also what about `bookshelf`? — MooingRawr, Apr 04 '17 at 16:24
@MooingRawr In my particular case, even if it removes `bookshelf` or `pre'book`, it is totally fine. — ObiWan, Apr 04 '17 at 16:30
@ssokhey Check the solution and let me know if it works for you. — Satish Prakash Garg, Apr 04 '17 at 16:54

Satish Prakash Garg · Accepted Answer · 2017-04-04T17:33:42.827

2

You can set a flag for every line and emit based on the flag value, something like this :

input_sample = [
    "I want my book.",
    "I need my books.",
    "Why you need a book?",
    "Let's go read.",
    "Coming to library"
]
words = ['book']
result = []
for line in input_sample : 
    flag = 0    # will be used to check if match is found or not
    for word in words : 
        if word.lower() in line.lower() :    # converting both words and lines to lowercase so case is not a factor in matching
            flag = 1    # flag values set to 1 on the first match
            break    # exits the inner for-loop for no more words need to be checked and so next line can be checked
    if flag == 0 :                      
        result.append(line)    # using lines when there is no match as if-matched, the value of flag would have been 1

print(result)

This results in :

["Let's go read.", 'Coming to library']

edited Apr 04 '17 at 17:33

answered Apr 04 '17 at 16:45

Satish Prakash Garg

2,213
2
16
25

You could make it additionally robust by forcing `input_sample` to lowercase (i.e. `input_sample.lower()`). This would eliminate the number of variations needed in `words`, as seen in OP's "yay" "Yay" example. – Jammeth_Q Apr 04 '17 at 16:58
@Jammeth_Q Edited. Thanks. – Satish Prakash Garg Apr 04 '17 at 16:59
@SatishGarg It is working for certain words but it fails to capture: `book?, book:, "book", book':, book!'.` There are way too many words like this. I was looking for a generic solution which removes every word which has, say `book` in it, like: `bookshelf`, `book-matching`. etc – ObiWan Apr 04 '17 at 17:10
@ssokhey The `input_sample` i used does contain `book?`, etc. and it is filtered. check `result` – Satish Prakash Garg Apr 04 '17 at 17:11
@SatishGarg I think it will work. Actually if I use `words = ['book']`, it is not filtering 'Books! Books etc.)' but if I use `words = ['Book']`, it is filtering everything. Why is that? – ObiWan Apr 04 '17 at 17:17
@ssokhey I am getting same result with both `['Book']` and `['book']` as `.lower()` is used so case does not matter. Every word and line will be converted to lower-case – Satish Prakash Garg Apr 04 '17 at 17:19
@ssokhey Added comments. – Satish Prakash Garg Apr 04 '17 at 17:34
Another possibly worthwhile addition would be to remove punctuation from each line: https://stackoverflow.com/questions/16050952/how-to-remove-all-the-punctuation-in-a-string-python – Jammeth_Q Apr 04 '17 at 19:52

How to remove set of words and their variants (or inflections) from a text file?

1 Answers1