How to split punctuation from words in python?

Question

I am attempting to create a spell checking function that reads in a text file containing a passage that contains several misspelt words. For example: "My favorite subjects are: Physcs, Maths, Chemistree and Biology - I find it necesary to use my iPad to make comprensive notes after class." I have three issues that I am trying to resolve:

Currently, the program considers Maths to be an incorrect word due to the comma that is present immediately after the word. I believe that in order to solve this issue, it would be best to split the string in the text file like so: ['My', 'favorite', 'subjects', 'are', ':', ' ', 'Physcs', ' ', 'Maths', ','...etc]. How do I split the string into words and punctuation without using any imported python functions (e.g. string or regex (re) functions)?
I am currently comparing each word with a dictionary of accepted English words by iterating over each word in the text file. Is there a better method to preprocess a list to quickly identify whether a word contains a given element to improve the runtime of the program?
There are several words such as 'eBook' and 'iPad' that are exceptions to the rules used in the function is_valid_word below (i.e. the word must start with a capital with all the other letters being lowercase or all characters in the word must be uppercase). Is there a way that I can check whether the string is a valid word?

Any help would be greatly appreciated!

def get_words():
    with open( "english.txt" ) as a:
         words = a.readlines()
    words = [word.strip() for word in words]
    return words

isWord = get_words()

def is_valid_word(st):
    if isinstance(st, str):
        st_lower = st.lower()
        if st_lower in isWord:
            if (st[0:len(st)].isupper() or st[0:len(st)].islower()) or (st[0].isupper() and st[1:len(st)].islower()) or st[0:len(st)].isupper():
                return (True)
            else: 
                return(False)
        else:
            return (False)
    else:
        return (False)

def spell_check_file( file ):
    incorrectWords = []  # Will contain all incorrectly spelled words.
    num = 0  # Used for line counter.
    with open(file, 'r') as f:
        for line_no, line in enumerate(f):
            for word in line.split():
                if is_valid_word(word) == False:
                    incorrectWords.append(line_no)
                    incorrectWords.append(word)
        for f in incorrectWords:
            return incorrectWords
            print (incorrectWords)

spell_check_file("passage.txt")

`.isupper()` and `.islower()` are inbuilt functions, anyway. — kaya3, Nov 14 '19 at 22:56
When I say inbuilt functions, I mean any that need to be imported (e.g. string) because the exercise book that I'm working from instructs us to only use split(), strip(), replace() etc. Apologies, I don't think my original question was clear - I shall amend it accordingly. — m.lewis1995, Nov 14 '19 at 22:56
[Third answer here](https://stackoverflow.com/questions/1059559/split-strings-into-words-with-multiple-word-boundary-delimiters) provides a way of separating into words only using split and replace (i.e. no imports). — DarrylG, Nov 14 '19 at 23:04

score 0 · Answer 1 · answered Nov 14 '19 at 22:54

0

This kind of task is what regexes are for. Trying to do this without regexes is a form of self-punishment.

>>> import re
>>> pattern = re.compile(r"[\w']+|\s+|[^\w'\s]+")
>>> pattern.findall("My favorite subjects are: Physics, Maths, Chemistry")
['My', ' ', 'favorite', ' ', 'subjects', ' ', 'are', ':', ' ', 'Physics', ',',
 ' ', 'Maths', ',', ' ', 'Chemistry']

Note that I've included ' in the part which matches words, so words like "don't" will stay in one piece.

answered Nov 14 '19 at 22:54

kaya3

47,440
4
68
97

But is there a way to do this without regexes? – m.lewis1995 Nov 14 '19 at 23:00
In theory, sure. You could write your own code which basically does what this regex does; scan consecutive characters, and yield a substring each time the next consecutive character would be in a different character class. (The three character classes are word-characters and apostrophe, whitespace characters, and everything else.) I don't think there is even much educational value in avoiding regexes here, though, because if you want to learn how to do string manipulation then you really ought to learn how to use regexes. – kaya3 Nov 14 '19 at 23:06

How to split punctuation from words in python?

1 Answers1