0

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):

{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}

A courageous feline was recently spotted in the neighborhood protecting her mini kitten

How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:

A fearless cat was recently spotted in the neighborhood protecting her little cat

The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well. Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure. For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword

edit: Refined the question

Hi There
  • 1,827
  • 2
  • 6
  • 9
  • What language is this? Python maybe? – trincot Oct 21 '22 at 13:42
  • Add some more clarification to this question, for example what code do you have to try to solve this, where is that code failing? Also, are synonyms unique? For example, is it possible to have entries such as `{'dog': ['pup']}, {'seal': ['pup']}`? If so, how do you know which synonym to use? If not, it seems that all you need is a mapping from synonym to root word. How large is the dictionary? Will it fit into memory or do you need to store it somewhere else? Also, is the algorithm expected to deal with word variants? E.g., if you find `kittens` are you expected to replace it with `cats`? – Welbog Oct 21 '22 at 13:42
  • Just for the simplified use case of replacing a specific word by a specific synonym (without taking into account declensions, plurals etc), you can make a map/dictionary/lookup table from each word to their desired synonym: `{'feline': 'cat', 'kitten'. 'cat'}`, and lookup in constant-time. Words not found at all need not be replaced. – Berthur Oct 21 '22 at 14:01

1 Answers1

1

Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.

So organise your dictionary in the opposite sense:

d = {
    'feline': 'cat',
    'kitten': 'cat'
}

To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:

import re

regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")

s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))

The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

trincot
  • 317,000
  • 35
  • 244
  • 286
  • The number of synonyms for each keyword will be very long d = { 'feline': 'cat', 'kitten': 'cat' } Hence, key words like 'cat' will be repeated a lot of times here – Hi There Oct 21 '22 at 14:09
  • 1
    OK, and is that a problem? You'll need to anyway mention the synonym, so accompanying it with the target word is in general not an increase of the space *complexity*. There is also the potential avoidance of duplication by [string interning](https://stackoverflow.com/questions/17679861/does-python-intern-strings) – trincot Oct 21 '22 at 14:23
  • @HiThere You can also use a separate array for your chosen synonyms, and map your keys to the index in that array, instead of to the string itself. Like: `'map = {'feline': 0, 'kitten': 0, 'courageous': 1'}` and `synonyms = ['cat', 'brave']`. But consider if you really need it: For single words in a natural language, you're not likely to gain anything by this, just go by what Trincot said. – Berthur Oct 21 '22 at 15:05