find words in a text

Question

I have a problem concerning the search for words for the purpose of a text.

In my code I look for words within an Italian text (this is divided into strings, based on the paragraphs) but when I have words like "e", "in", "ad", it tells me that it finds them many times but in reality, these are words like "begin", "adduce" and any word that contains the e. Is there an efficient way to avoid this "mistake"? I have searched everywhere but I just can't find anything, I think it's a simple problem but I'm not an expert at all, thanks to those who will help me. I would like to do it without importing any libraries

sample text: ['sostanza di cieli ed astri cercai per oceani. di donarmi il diluvio ti dissi io, o musa, scorgendo il destino.', " o zeus che infiniti addurre volle, principiando con stormi arditi fulmini di ira molto funesta laddove si alzasse eccessivamente il volare negato all'uomo.", 'imperterrita irrefrenabile poiché poiché memore di ciò, da qualunque principio, memore di di di ciò di ciò, da qualunque principio, ad ogni costo, dea figlia di zeus, narrane cagione e spirito. ']

i had to find these words (there is a possibility that not all of them are in the text, for example 'e' is missing): uomo, dissi io, o musa, molto, eccessivamente, e, in, di ciò

expected output: uomo, dissi io, o musa, molto, eccessivamente, di ciò

Can you provide some sample of yout text and expected output? — IoaTzimas, Aug 04 '21 at 15:46
search for `" e "`, `" in "`, `" ad "` (with spaces), this way it will only show them if they are single words — Einliterflasche, Aug 04 '21 at 15:48
@Einliterflasche yes, that is a good idea, but how can I do this? — orsettomorbido, Aug 04 '21 at 15:55
@orsettomorbido you could write a function that returns the inputted string but with spaces (`"ad"` => `" ad "`), but this is not as elegant as the regex solution by H. Rittich — Einliterflasche, Aug 04 '21 at 16:02

ti7 · Answer 1 · 2021-08-04T16:16:34.503

You likely want something more advanced which understands the grammar of the language you're trying to parse, but this may work for you

split each paragraph up into individual words
check each word for closeness to your word (ie Levenshtein distance or another metric)

Perhaps

import difflib

def iter_test_words(source_paragraph, words_to_check):
    for word_test in source_paragraph.split():  # split by whitespace:
        yield difflib.get_close_matches(word_test, words_to_check, n=1, cutoff=0.9)

Some further help

you could try/except and find the first index in the returned list [0] to find anomalous words (IndexError)
you likely need to tune your cutoff as-needed (or even dynamically; ie re-try for anomalies) to get good results

again, using and configuring a library for your needs will probably give better results .. ideally something which

understands the grammar

understands subtle (for computers) word variations (ie. for your case, are Italian tenses of "to go" andando and andato the same? but that ondato "wave" is another concept despite being a better textual match)

>>> import difflib
>>> difflib.get_close_matches("andato", ["andando", "ondato"])
['ondato', 'andando']
>>> difflib.SequenceMatcher(None, "andato", "andando").ratio()
0.7692307692307693
>>> difflib.SequenceMatcher(None, "andato", "ondato").ratio()
0.8333333333333334

score 1 · Answer 2 · answered Aug 04 '21 at 15:51

You can use regular expression for this purpose. The special sequence \b matches word boundaries. For example, searching for the pattern \bin\b will search for the beginning of a word, followed by "in", followed by the end of a word.

Here is the code:

>>> import re
>>> len(re.findall(r'\bin\b', 'begin in begin end'))
1

find words in a text

2 Answers2