I was recently working on a data set that used abbreviations for various words. For example,
wtrbtl = water bottle
bwlingbl = bowling ball
bsktball = basketball
There did not seem to be any consistency in terms of the convention used, i.e. sometimes they used vowels sometimes not. I am trying to build a mapping object like the one above for abbreviations and their corresponding words without a complete corpus or comprehensive list of terms (i.e. abbreviations could be introduced that are not explicitly known). For simplicity sake say it is restricted to stuff you would find in a gym but it could be anything.
Basically, if you only look at the left hand side of the examples, what kind of model could do the same processing as our brain in terms of relating each abbreviation to the corresponding full text label.
My ideas have stopped at taking the first and last letter and finding those in a dictionary. Then assign a priori probabilities based on context. But since there are a large number of morphemes without a marker that indicates end of word I don't see how its possible to split them.
UPDATED:
I also had the idea to combine a couple string metric algorithms like a Match Rating Algorithm to determine a set of related terms and then calculate the Levenshtein Distance between each word in the set to the target abbreviation. However, I am still in the dark when it comes to abbreviations for words not in a master dictionary. Basically, inferring word construction - may a Naive Bayes model could help but I am concerned that any error in precision caused by using the algorithms above will invalid any model training process.
Any help is appreciated, as I am really stuck on this one.