I am attempting to create a spell checking function that reads in a text file containing a passage that contains several misspelt words. For example: "My favorite subjects are: Physcs, Maths, Chemistree and Biology - I find it necesary to use my iPad to make comprensive notes after class." I have three issues that I am trying to resolve:
Currently, the program considers Maths to be an incorrect word due to the comma that is present immediately after the word. I believe that in order to solve this issue, it would be best to split the string in the text file like so: ['My', 'favorite', 'subjects', 'are', ':', ' ', 'Physcs', ' ', 'Maths', ','...etc]. How do I split the string into words and punctuation without using any imported python functions (e.g. string or regex (re) functions)?
I am currently comparing each word with a dictionary of accepted English words by iterating over each word in the text file. Is there a better method to preprocess a list to quickly identify whether a word contains a given element to improve the runtime of the program?
There are several words such as 'eBook' and 'iPad' that are exceptions to the rules used in the function
is_valid_word
below (i.e. the word must start with a capital with all the other letters being lowercase or all characters in the word must be uppercase). Is there a way that I can check whether the string is a valid word?
Any help would be greatly appreciated!
def get_words():
with open( "english.txt" ) as a:
words = a.readlines()
words = [word.strip() for word in words]
return words
isWord = get_words()
def is_valid_word(st):
if isinstance(st, str):
st_lower = st.lower()
if st_lower in isWord:
if (st[0:len(st)].isupper() or st[0:len(st)].islower()) or (st[0].isupper() and st[1:len(st)].islower()) or st[0:len(st)].isupper():
return (True)
else:
return(False)
else:
return (False)
else:
return (False)
def spell_check_file( file ):
incorrectWords = [] # Will contain all incorrectly spelled words.
num = 0 # Used for line counter.
with open(file, 'r') as f:
for line_no, line in enumerate(f):
for word in line.split():
if is_valid_word(word) == False:
incorrectWords.append(line_no)
incorrectWords.append(word)
for f in incorrectWords:
return incorrectWords
print (incorrectWords)
spell_check_file("passage.txt")