0

I would like to read each word from a given text file and then want to compare these word with an existing English dictionary which may be a system dictionary or any other way. Here is the code I have tried, but in the following code, there is a problem. The following codes reading brackets or any other unnecessary characters.

f=open('words.txt')
M=[word for line in f for word in line.split()]
S=list(set(M))

for i in S:
    print i

How can I do the job?

Nullman
  • 4,179
  • 2
  • 14
  • 30
MKS
  • 149
  • 1
  • 5
  • Can you show actual input? – Alderven Feb 21 '19 at 12:23
  • i have a feeling your question is bigger than you realize, look at the `nltk` package, specifically at `nltk.tokenize` [here](https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer) is a similar question – Nullman Feb 21 '19 at 12:25
  • @Alderven Here is the input file https://www.gutenberg.org/files/1342/1342-0.txt – MKS Feb 21 '19 at 12:35
  • By coding more that you did - there is no try to compare anything against any dictionary. There is no attempt to clean up your splitted words. There is also no test data to go by. We do not code for you - I suggest reading about [NLTK](https://www.nltk.org/), stemming, fill- and stopwords, and language processing ... broadly. – Patrick Artner Feb 21 '19 at 12:49

2 Answers2

1

The str.strip() function will be useful for you. The following code removes all circle braces:

f=["sagd  sajdvsja  jsdagjh () shdjkahk sajhdhk (ghj jskldjla) ...."]
M=[word.strip("()") for line in f for word in line.split()]
S=list(set(M))

for i in S:
    print (i)
Hoog
  • 2,280
  • 1
  • 14
  • 20
  • it is not the complete answer. How to compare these with a valid English dictionary? – MKS Feb 21 '19 at 12:42
  • which valid English dictionary? Once you find a list of words that are acceptable replace print(i) with a check to see if i is in your list of acceptable numbers – Hoog Feb 21 '19 at 12:50
  • suppose Oxford English Dictionary. – MKS Feb 21 '19 at 12:55
  • Do a google search for Oxford English Dictionary and find a list of words you like. – Hoog Feb 21 '19 at 12:58
1

You can use regex to filter non-letters:

import re

M = []
with open('words.txt') as f:
    for line in f.readlines():
        for word in line.split():
            word = re.findall('[A-Za-z]+', word)
            if word:
                M.append(word[0])

S = list(set(M))

for i in S:
    print(i)

Output:

computer
respect
incautiously
softened
satisfied
child
ideas
devoting
overtaken

etc.

Alderven
  • 7,569
  • 5
  • 26
  • 38