1

I have a list of short hand text. All in English Language. Is there a Machine Learning algorithm that can be used to expand these abbreviations? For example, if the short hand is 'txt', it could suggest 'text', 'context', 'textual', etc with varying penalty values.

In addition, when I make a choice on the right word, I want it to learn this such that when next I input same shorthand, my choice get's high ratings.

Edit

Specifically, I have tried using this Language model described here but it only works for edits up to two levels. The 'edit' function is below:

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

It basically starts with one letter and then deletes, transposes, replaces and inserts letters (using letters of the alphabet).

How do I extend this to more than two edits?

Tonechas
  • 13,398
  • 16
  • 46
  • 80
TheDataGuy
  • 371
  • 1
  • 6

1 Answers1

1

The first part has to do with producing words and the second has to do with ranking those words (and updating those rankings). I'll address the two parts in turn and try to point out any machine learning as that was part of the original question.

For the first part, I don't think you'll need machine learning and admittedly thinking about this a little, it seems artificial to use ML for this part. I think you could make good head-way with a dictionary of acronyms combined with use of synonyms.

  1. For example, start by looking up "txt" in a list such as this which lists "text" as an expansion for "txt".
  2. Take "text" and look up synonyms. You may want to restrict synonyms to those that look similar to the original acronym i.e. containing a substring with small edit-distance to "txt" or containing the acronym from the acronym dictionary ('text'). Take a look at this post for how to use NTLK for finding Synsets.

The important part here is to cover all the acronyms you'll encounter, so you may want to allow the user to enter in missing acronyms and expansions for those acronyms.

For the second part, you may want to combine two scoring algorithms to assign a score to each word and rank the words by their scores.

The first scoring algorithm should be something that works without any user data so that initially you have some semi-intelligent ordering of words. An example would be scoring a word based on how many edits that word is to the acronym. So "textual" would get a lower score than "text" for the acronym "txt" because it requires a few more letters to go from "txt" to "textual".

The second scoring algorithm would take over as you get more user data. An example of something you could use would be to keep track of the popularity of each word (i.e. what fraction of times it was chosen). See Online machine learning.

Combine the two scores into a final score via a learned linear function (See Linear Regression).

Vishaal
  • 735
  • 3
  • 13