0

The question is to:

  • Firstly,find the number of all words in a text file
  • Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
  • Thirdly, find the number of the remaining words (unique words)
  • Make a list of them

the file name should be used as the parameter of the function

I have done the first part of the question

import re

file = open('text.txt', 'r', encoding = 'latin-1')

word_list = file.read().split()

for x in word_list:
    print(x)

res = len(word_list)
print ('The number of words in the text:' + str(res))


def uncommonWords (file):
    uncommonwords = (list(file))
    for i in uncommonwords:
        i += 1
        print (i)

The code shows till the number of the words and nothing appears after that.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Mahsa
  • 1
  • 2

2 Answers2

0

you can do it like this

# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])

# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
    for line in text_file:
        for word in line.split():
            words_in_file.add(word)

# remove common words from word list
unique_words = words_in_file - stop_words

print(list(unique_words))
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
0

First, you may want to get rid of punctuation : as showed in this answer, you should do :

 nonPunct = re.compile('.*[A-Za-z0-9].*')
 filtered = [w for w in text if nonPunct.match(w)]

then, you could do

from collections import Counter
counts = Counter(filtered)

you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with

[word for word in list(counts.keys()) if word not in common_words]

Hope this answers your question.

théol
  • 1
  • 1