0

I have a problem concerning the search for words for the purpose of a text.

In my code I look for words within an Italian text (this is divided into strings, based on the paragraphs) but when I have words like "e", "in", "ad", it tells me that it finds them many times but in reality, these are words like "begin", "adduce" and any word that contains the e. Is there an efficient way to avoid this "mistake"? I have searched everywhere but I just can't find anything, I think it's a simple problem but I'm not an expert at all, thanks to those who will help me. I would like to do it without importing any libraries

sample text: ['sostanza di cieli ed astri cercai per oceani. di donarmi il diluvio ti dissi io, o musa, scorgendo il destino.', " o zeus che infiniti addurre volle, principiando con stormi arditi fulmini di ira molto funesta laddove si alzasse eccessivamente il volare negato all'uomo.", 'imperterrita irrefrenabile poiché poiché memore di ciò, da qualunque principio, memore di di di ciò di ciò, da qualunque principio, ad ogni costo, dea figlia di zeus, narrane cagione e spirito. ']

i had to find these words (there is a possibility that not all of them are in the text, for example 'e' is missing): uomo, dissi io, o musa, molto, eccessivamente, e, in, di ciò

expected output: uomo, dissi io, o musa, molto, eccessivamente, di ciò

2 Answers2

2

You likely want something more advanced which understands the grammar of the language you're trying to parse, but this may work for you

Perhaps

import difflib

def iter_test_words(source_paragraph, words_to_check):
    for word_test in source_paragraph.split():  # split by whitespace:
        yield difflib.get_close_matches(word_test, words_to_check, n=1, cutoff=0.9)

Some further help

  • you could try/except and find the first index in the returned list [0] to find anomalous words (IndexError)
  • you likely need to tune your cutoff as-needed (or even dynamically; ie re-try for anomalies) to get good results

again, using and configuring a library for your needs will probably give better results .. ideally something which

  • understands the grammar
  • understands subtle (for computers) word variations (ie. for your case, are Italian tenses of "to go" andando and andato the same? but that ondato "wave" is another concept despite being a better textual match)
    >>> import difflib
    >>> difflib.get_close_matches("andato", ["andando", "ondato"])
    ['ondato', 'andando']
    >>> difflib.SequenceMatcher(None, "andato", "andando").ratio()
    0.7692307692307693
    >>> difflib.SequenceMatcher(None, "andato", "ondato").ratio()
    0.8333333333333334
    
ti7
  • 16,375
  • 6
  • 40
  • 68
1

You can use regular expression for this purpose. The special sequence \b matches word boundaries. For example, searching for the pattern \bin\b will search for the beginning of a word, followed by "in", followed by the end of a word.

Here is the code:

>>> import re
>>> len(re.findall(r'\bin\b', 'begin in begin end'))
1
H. Rittich
  • 814
  • 7
  • 15