How to find the number of common words in a text file and delete them in python?

Question

The question is to:

Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them

the file name should be used as the parameter of the function

I have done the first part of the question

import re

file = open('text.txt', 'r', encoding = 'latin-1')

word_list = file.read().split()

for x in word_list:
    print(x)

res = len(word_list)
print ('The number of words in the text:' + str(res))


def uncommonWords (file):
    uncommonwords = (list(file))
    for i in uncommonwords:
        i += 1
        print (i)

The code shows till the number of the words and nothing appears after that.

well, you define a function but never call it (`uncommonWords`), so that is expected. — Derlin, Sep 17 '19 at 07:42
If you mean I should try the 'return file' at the end, I tried that too but it didn't work — Mahsa, Sep 17 '19 at 07:53

score 0 · Answer 1 · answered Sep 17 '19 at 07:48

you can do it like this

# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])

# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
    for line in text_file:
        for word in line.split():
            words_in_file.add(word)

# remove common words from word list
unique_words = words_in_file - stop_words

print(list(unique_words))

score 0 · Answer 2 · answered Sep 17 '19 at 07:52

First, you may want to get rid of punctuation : as showed in this answer, you should do :

 nonPunct = re.compile('.*[A-Za-z0-9].*')
 filtered = [w for w in text if nonPunct.match(w)]

then, you could do

from collections import Counter
counts = Counter(filtered)

you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with

[word for word in list(counts.keys()) if word not in common_words]

Hope this answers your question.

How to find the number of common words in a text file and delete them in python?

2 Answers2