I am looking to analyse several (like 30,000 or so) small documents and determine if they include references to a certain subject, like the term "safety". It's easy enough to do a string.find() or tokenize the raw text and compare lists, but I would like the search terms to be dynamic, so if the user types in "safety", my program identifies all forms of the word. So "safety" would compare words like "safe", "safely", "safer", "safest", etc to search the raw text for. My hope would be for the user to put in any term and have a reasonable expectation it will find related terms in the source documents.
I have looked at stemming and lemmatizing, but stemming comes back with some crazy results (ie. "safety" comes back as "safeti" while "safely" stems to "safe") and lemmatizing more often than not returns the given search term. I've tried the two suggestions shown here with same results:
How to list all the forms of a word using NLTK in python
Any help would be appreciated. If all else fails, I'll just build a list of terms at runtime from a user input.