16

BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. Take new input.
  2. Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

Dinero
  • 1,070
  • 2
  • 19
  • 44
  • 2
    use spacy's ner and you can also train the spacy model with your data. – Aaditya Ura Dec 06 '17 at 04:20
  • 1
    @AyodhyankitPaul i will google that right now! thanks for feedback and if possible would love it if you can provide small demo, would love to see this – Dinero Dec 06 '17 at 04:22

2 Answers2

30

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • 1
    This is very interesting. So obviously the word embeddings have already been created. When i tried print(process('kobe')) it classified 'kobe' as a place even though 'kobe' is a name however when i added 'kobe' to the data dictionary as type name it classified kobe as a name . I am trying to understand what is happening under the hood. It gave the highest score to name (9.38) but the score for place category was pretty close (9.08). – Dinero Dec 12 '17 at 01:33
  • 3
    Some terms are naturally on the border. Remember than embeddings are learned from the texts. E.g., `paris` is frequently used both as a city and a name, Paris Hilton. Same for Kobe: I know only one usage as a name, though very popular, but it's also as a place in Japan - https://en.wikipedia.org/wiki/Kobe . This is a common problem in classification. As for the general understanding, see this answer - https://stackoverflow.com/a/46727571/712995 and further links it refers to – Maxim Dec 12 '17 at 07:48
  • Also I found that when I did print(process('a2')) I got negative scores for all 3 categories. Then I added a new category called "Id" where i added values like a1,a2,b1,b2 . Then i did print(process('a2')) and print(process('c2')) and i got high score for "id" category in both cases. So the code above is it learning meaning of new values under the hood? Since i added a new category called Id it is somehow able to figure out that values like a1,b2,c3 are closely related. – Dinero Dec 12 '17 at 15:18
  • Also would it make a difference if a value like "Kobe" occcours several times in Name category. Does this code take frequency of occurance in to consideration? – Dinero Dec 12 '17 at 15:19
  • 2
    1) Of course it would, but you'll have to change python dict to a list of tuples. It would be simpler to have a separate index of coefficients per different words, if you want to go this way. 2) Negative score is absolutely possible, no problem here. 3) This solution uses an *already trained* model. If you want to train it yourself, it's totally possible, but bare in mind the size of training data must very large to make a difference. Something comparable to the size of wikipedia. – Maxim Dec 12 '17 at 15:31
  • I am confused because for a word like paris it makes sense that there is word embedding. For a term like "a1","a2" they are not english words so there are no embedding for them. So how is it exactly able to classify them in one category? I see the results i want but want to understand how it is happening – Dinero Dec 12 '17 at 15:36
  • 2
    It knows a lot of words, because it was trained on an enormous text corpus. Which, apparently, has something about `a1`, `a2`, ... Describing GloVe in detail would need a lot of space, you can start here: https://nlp.stanford.edu/projects/glove/ – Maxim Dec 12 '17 at 15:41
  • 2
    I think it's a nice & elegant solution (+1); and terms on the border, such as 'kobe' (which I also knew as a place, and not a name), can be addressed with additional post-processing rules (e.g. when the difference between the two highest scores is below a threshold, return both etc) – desertnaut Dec 12 '17 at 16:26
  • @Maxim This looks good I tested it out just wondering what if i had a category with bi grams or tri grams. Lets say i have a bunch of addresses ('10 hacker road','123 washington street') etc. Would it be possible to use this approach still ? – Dinero Dec 14 '17 at 18:55
  • @Dinero If I understand your question right, it's still possible, but will require a bit more work. – Maxim Dec 14 '17 at 21:12
  • @Maxim would love it if you can add that approach to your answer or describe the approach – Dinero Dec 15 '17 at 14:18
  • Maybe a little late to ask, but i need to know the meaning of the scores. Im doing this in a project and i dont know exactly how to explain this more than "The most points, the most accurate" – Nino Gutierrez Dec 26 '19 at 14:58
1

Also, what its worth, PyTorch has a good and faster implementation of Glove these days.

mithunpaul
  • 3,268
  • 22
  • 19