Python - How to intuit word from abbreviated text using NLP?

Question

I was recently working on a data set that used abbreviations for various words. For example,

wtrbtl = water bottle
bwlingbl = bowling ball
bsktball = basketball

There did not seem to be any consistency in terms of the convention used, i.e. sometimes they used vowels sometimes not. I am trying to build a mapping object like the one above for abbreviations and their corresponding words without a complete corpus or comprehensive list of terms (i.e. abbreviations could be introduced that are not explicitly known). For simplicity sake say it is restricted to stuff you would find in a gym but it could be anything.

Basically, if you only look at the left hand side of the examples, what kind of model could do the same processing as our brain in terms of relating each abbreviation to the corresponding full text label.

My ideas have stopped at taking the first and last letter and finding those in a dictionary. Then assign a priori probabilities based on context. But since there are a large number of morphemes without a marker that indicates end of word I don't see how its possible to split them.

UPDATED:

I also had the idea to combine a couple string metric algorithms like a Match Rating Algorithm to determine a set of related terms and then calculate the Levenshtein Distance between each word in the set to the target abbreviation. However, I am still in the dark when it comes to abbreviations for words not in a master dictionary. Basically, inferring word construction - may a Naive Bayes model could help but I am concerned that any error in precision caused by using the algorithms above will invalid any model training process.

Any help is appreciated, as I am really stuck on this one.

A fuzzy search algorithm might help solve this problem. Relevant discussions: [StackOverflow](https://stackoverflow.com/questions/16907825/how-to-implement-sublime-text-like-fuzzy-search), [Reddit](https://amp-reddit-com.cdn.ampproject.org/v/s/amp.reddit.com/r/programming/comments/4cfz8r/reverse_engineering_sublime_texts_fuzzy_match/) — Udayraj Deshmukh, Feb 24 '19 at 21:27

score 33 · Accepted Answer · edited Apr 23 '20 at 12:01

If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.

For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.

Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.

Update. I got enthusiastic about this problem and implemented this algorithm:

language model (character 5-gram trained on the Lord of the Rings text)
noise model (probability of each symbol being abbreviated)
beam search algorithm, for candidate phrase suggestion.

My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model), which, by the way, returns {'basket ball': 33.5, 'basket bally': 36.0}. Dictionary values are scores of the suggestions (the lower, the better).

With other examples it works worse: for 'wtrbtl' it returns

{'water but all': 23.7, 
 'water but ill': 24.5,
 'water but lay': 24.8,
 'water but let': 26.0,
 'water but lie': 25.9,
 'water but look': 26.6}

For 'bwlingbl' it gives

{'bwling belia': 32.3,
 'bwling bell': 33.6,
 'bwling below': 32.1,
 'bwling belt': 32.5,
 'bwling black': 31.4,
 'bwling bling': 32.9,
 'bwling blow': 32.7,
 'bwling blue': 30.7}

However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.

score 16 · Answer 2 · answered Apr 20 '17 at 07:45

So I've looked at a similar problem, and came across a fantastic package called PyEnchant. If you use the build in spell-checker you can get word suggestions, which would be a nice and simple solution. However it will only suggest single words (as far as I can tell), and so the situation you have:

wtrbtl = water bottle

Will not work.

Here is some code:

import enchant

wordDict = enchant.Dict("en_US")

inputWords = ['wtrbtl','bwlingbl','bsktball']
for word in inputWords:
    print wordDict.suggest(word)

The output is:

['rebuttal', 'tribute']
['bowling', 'blinding', 'blinking', 'bumbling', 'alienable', 'Nibelung']
['basketball', 'fastball', 'spitball', 'softball', 'executable', 'basketry']

Perhaps if you know what sort of abbreviations there are you can separate the string into two words, e.g.

'wtrbtl' -> ['wtr', 'btl']

There's also the Natural Language Processing Kit (NLTK), which is AMAZING, and you could use this in combination with the above code by looking at how common each suggested word is, for example.

Good luck!

Can it suggest only most probable word instead of a list? – Mohith7548 Feb 11 '21 at 04:40 — Mohith7548, Feb 11 '21 at 04:40

score 10 · Answer 3 · answered Jan 08 '18 at 00:58

10

One option is to go back in time and compute the Soundex Algorithm equivalent.

Soundex drops all the vowels, handles common mispronunciations and crunched up spellings. The algorithm is simplistic and used to be done by hand. The downside is that has no special word stemming or stop work support.

answered Jan 08 '18 at 00:58

Charles Merriam

19,908
6
73
83

This is an interesting idea but I don't believe it fits the circumstances of my problem as it requires that you start with the complete word and then encode the word using the algorithm. My problem is that I am starting with the abbreviation and I am trying to get the full word from it. – Dan Temkin Jan 08 '18 at 04:16
True. You could either store the dictionary of hashes or find the a restricted Levenshtein distance allowing only the adding of vowels and doubling of characters. Its either a space or speed choice. – Charles Merriam Jan 09 '18 at 06:42

score 4 · Answer 4 · answered Jan 09 '18 at 06:24

4

... abbreviations for words not in a master dictionary.

So, you're looking for a NLP model that can come up with valid English words, without having seen them before?

It is probably easier to find a more exhaustive word dictionary, or perhaps to map each word in the existing dictionary to common extensions such as +"es" or word[:-1] + "ies".

answered Jan 09 '18 at 06:24

Melvin

1,530
11
18

My thought process was how to handle unexpected values. I mention above that the problem, for convenience, could be restricted to objects in a gym . Now if someone brings their lunch then "sndwch" might show up in the data but it is unlikely that sandwich will be in a dictionary of gym items. So I am trying to find a solution that doesn't fail under uncertainty. The word doesn't have to be 'valid english' but just a probable approximation of the target word. – Dan Temkin Jan 09 '18 at 23:29
One issue with this would be the threshold for uncertainty, as you mentioned. Most ML model will be dealing with probabilities, i.e. how probable it is that "sndwch" actually refers to a particular word (or set of words) in the gym dictionary. Setting the threshold appropriately, can allow the model to seek for an entry in an external dictionary, as opposed to the domain-specific one (gym vs full_dict). – Melvin Jan 10 '18 at 05:02
I would expect the threshold to be based on the disparity in probability between say the highest-probable word in the domain-specific dictionary (sndwch=sand bags?) vs that in the full dictionary (sandwich). The problem is, it is going to be very difficult to set a proper threshold, and the optimal level will probably change based on domain-vs-external dictionary size ratio, or even on the level of abbreviation for each domain dataset. – Melvin Jan 10 '18 at 05:03
In other words, it's probably not impossible, but I strongly suspect an easier solution might exist, perhaps requiring changing the way that the problem is approached. Also, just guessing, but I'm sensing that there may an XY problem here. – Melvin Jan 10 '18 at 05:05
Not sure what you mean by an XY problem. But, otherwise I completely agree. My first instinct was to implement a "divide and conquer" paradigm when doing dictionary comparisons where I would first assign each word a category or domain and then query the domain specific dict. The problem I ran into was that the errors were compounded when the initial categorization was false. But thanks for your comments. Very insightful. :) – Dan Temkin Jan 10 '18 at 20:05

Python - How to intuit word from abbreviated text using NLP?

4 Answers4

Linked