How can I compare a word against a list of chosen words to find the word that correlates the strongest?

Question

I am looking to design a search box that will take any input and return the most appropriate output from a chosen list of outputs.

As an example, my chosen list of outputs are animal,vehicle and place.

If the user searches for cat, I would like the code to run cat vs animal,vehicle and place. A correlation/matching score will be determined for each. With animal generating the highest correlation. The output will then be animal.

Similarly, typing in car, will output vehicle from the list.

Any ideas on what is the best way to generate this correlation score? My output list consists of 100 different terms.

Unless you have a massive amount of contextual training data this would be extremely difficult (and I would be almost tempted to say impossible) for an arbitrary input. — nico, Mar 18 '15 at 12:17
Is the input unconstrained, how big a dictionary could it be? Could you just enumerate the most common? Otherwise, either train your own model (LDA? Bayesian?) like @nico said, or use an API per my answer, or use the API to train your model. — smci, Mar 18 '15 at 12:23

score 1 · Answer 1 · edited May 23 '17 at 11:56

1

You're looking for a classifier. Either lookup an API dynamically, or use the API to train your model (and maybe try the API as fallback if your model doesn't hit).

e.g. one way is to use Wiktionary API per the answer in Is there any free online dictionary API (json/xml) with multiple languages to choose from?

Here is the entry for cat:English:Etymology 1:Noun and then you just need to process the entry to spot keywords like animal/vehicle/place. It's doable.

Or just look for an online list of animals, vehicles, places.

There are many other APIs, most require registration, some are paid.

edited May 23 '17 at 11:56

Community

1
1

answered Mar 18 '15 at 12:28

smci

32,567
20
113
146

This is a good approach but can easily fail for an arbitrary input (maybe not in the case of animals, but definitely for places). Also: [is Jaguar a car or a feline](http://www.zeroto60times.com/2013/03/complete-list-cars-animal-names/)? – nico Mar 18 '15 at 13:52
Note that I am not saying your answer is bad, it is actually probably the best solution to what is IMHO a poorly defined problem :) If instead of single words the program was accepting snippets of text then other approaches could be taken. – nico Mar 18 '15 at 13:54
@nico : I already mentioned the LDA or Bayesian approaches in my comment above, which would distinguish contexts, based on the other words. I seriously doubt the OP really wants to train an arbitrarily-complex classifier just for this task. – smci Mar 19 '15 at 03:27

How can I compare a word against a list of chosen words to find the word that correlates the strongest?

1 Answers1