1

I have a list of months in Latin:

latinMonths = ['januarii', 'februarii','martii', 'aprilis', 'maii', 'junii', 'julii', 'augusti', 'septembris', 'octobris', 'novembris', 'decembris']

which unfortunately in my text I find variations of them spelled differently ex: 'januarij' or 'septembrjs' etc...

I am trying to scan the text to find the exact word as the the list or variations of it.

I know I can use difflib and found that I can check a sentence with a list of words in this post: Python: how to determine if a list of words exist in a string. Is there a way I can combine both, thus finding an instance in a string where the months in the list or variations of it exist?

ex: If I have the text "primo januarij 1487" I would like to return true as januarij is a close match to january while if I have 'I love tomatoes' neither of the words are a close match or exact match to the words in the list

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
gannina
  • 173
  • 1
  • 8
  • 1
    I'm not quite clear on what you're trying to do. Can you give an example? – glibdud Jul 27 '18 at 12:13
  • 2
    I think you might want to match all the words within a [levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) of 1 from the words in your list. A naive but working way to do so is to split your text in a list of words and compute for each word the distance to all the months. – hugoShaka Jul 27 '18 at 12:17
  • I was thinking of that but unfortunately my text can be quite long and contain a date...it can be quite inefficient to check all the words in the text...thought there was another way :S – gannina Jul 27 '18 at 12:19
  • It's impossible to generate regex for undefined variations of words. If you want a regex then it's going to be required that you explicitly list all the variations you expect to find. – Tomalak Jul 27 '18 at 12:23
  • thanks maybe I can generate the variations expected! :) – gannina Jul 27 '18 at 12:25
  • And if you can list all the variations you expect to find, you could also solve this without regex. ;) Maybe a good starting point would be something like this: Find all the matches of `\bjan[a-z]*\b`, make that list unique, throw out all false positives, repeat with the same process the other months - and there is your list of variations. – Tomalak Jul 27 '18 at 12:41

1 Answers1

1

A possible solution could be achieved using fuzzywuzzy as follows:

from fuzzywuzzy import fuzz

def fuzzy_months(text:str, months:list, treshold:float = 0.9)->bool:
    """Return if a word within the given text is close enough to any given month."""
    return max([fuzz.ratio(month,word) for month in latinMonths for word in test_string.split()])/100>= treshold

For example taking in consideration the following phrases test_string = 'lorem ipsum siptum abet septembrjs' and fail_string = 'do you want to eat at McDonald?':

fuzzy_months(test_string, latinMonths)
>>> True

fuzzy_months(fail_string, latinMonths)
>>> False