0

I want to remove just those lines from a Myfile.txt file if the line contains just only and only contain any of from the stopwords

for example, the sample of the Myfile.txt file is

Adh Dhayd
Abu Dhabi is      # here is "is" stopword but this line should not be removed because line contain #Abu Dhabi is
Zaranj
of                # this line contains just stop word, this line should be removed
on                # this line contains just stop word, this line should be removed
Taloqan
Shnan of          # here is "of" stopword but this line should not be removed because line contain #Shnan of
is                # this line contains just stop word, this line should be removed
Shibirghn
Shahrak
from              # this line contains just stop word, this line should be removed

I have this code as an example

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize



example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

So what will be the solution code for a Myfile.txt according to the mention above.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
csit
  • 23
  • 7

1 Answers1

0

You could look if the line matches any of the stopwords, if not append it to the filtered content. That is if you want to filter all lines that only contain exactly one stop_word. If a line with multiple stop words should also be filtered, try to tokenize the line, and build the intersection with the stop_words:

f = open("test.txt","r+")
filtered_content = []
stop_words = set(stopwords.words('english'))
for line in f.read().splitlines():
    if not line in stop_words:
        filtered_content.append(line)
g = open("test_filter.txt","a+")
g.write("\n".join(filtered_content))
g.close()
f.close()

If you want the to remove mutliple stopwords, use this if-statement. This removes a line which contains only stopwords. If one word is not a stopword, the line is kept:

if not len(set(word_tokenize(line)).intersection(stop_words)) == len(word_tokenize(line)):
f.wue
  • 837
  • 8
  • 15
  • can You Help me how can we remove without care of case,it should not be case sensitive. – csit Mar 07 '19 at 13:16
  • You can use `line.lower()`. However, try to search for questions you have, as this question(in your comment) is answered many times and in many tutorials :) (https://stackoverflow.com/questions/6797984/how-to-lowercase-a-string-in-python) – f.wue Mar 07 '19 at 13:19