0

I've a numpy array, matrixValue, and 3 lists containing the following:

matrixValue: (type: ndarray) number of occurences of a word in wordlist in descending order

[.. 62 62 ..]
[.. 23 21 ..]
[.. 14 13 ..]

valueList: (type: list) number of occurences of a word in wordlist in descending order

[... 74, 71, 63, 62, 62, 50, 40, 23, 21, 14, 13, 11, 11...]

userGivenWord: (type: list) user-specified words

[... water, animal, flower...]

wordList: (type: list) contains a list of English dictionary words

[.. water, ocean, lake, green, blue, sea...]

Given a user-defined word, I'm to retrieve words from the wordlist for which there is some "semantic similarity". My problem is that for any repeated occurences (e.g. 62, 11) as seen in valueList, only the first english word 'lake' is printed and not sea (assume that both blue and lake occur 62 times each).

Output

water = lake, lake # wrong output
water = lake, sea # correct output

Here is the part of code that I'm very sure is causing the problem:

for i in range(0, 3): # printing top 3 words
    value = matrixValue[i] # returns the first two numbers, 62 & 62
    iValue = valueList.index(value) # returns the indexes in valueList for the above value 
    tagword = str(tag_list[iValue]) # retrieves the word based on the iValue
    res = userTags[x] + " = " + tagword

Again, both lake and sea occur 62 times. I believe the error happens in the second line of the for-loop. When I viewed in the debugger, I noticed the word 'lake' is added twice to the result list (and not sea). I'm not sure if I'm being coherent here or if I am clear in writing the question... but please feel free to ask me questions if clarification needs to be made.

Kuan E.
  • 373
  • 1
  • 3
  • 15
  • Pseudocode like `[.. 62 62 ..]` and `[.. water, ocean, lake, green, blue, sea...] ` makes it harder for people to answer your question because they need to spend time guessing what the actual shape of your array is and then manually type out some data that hopefully has the same properties as yours. Try and make it so someone can copy, paste and run your code in the question. See [this question](http://stackoverflow.com/q/19268937/553404) for a data-as-code example, [these R guidelines](http://stackoverflow.com/a/5963610/553404) and this [SO help page](http://stackoverflow.com/help/mcve). – YXD Mar 07 '15 at 11:50
  • Thanks @MrE, I will read the guides you've kindly provided me, update my question and explain my data structure in better view. I might require some time though to recreate the structures in a similar context to this. – Kuan E. Mar 07 '15 at 12:15
  • It's just a suggestion - if you get the answer you want anyway then that's great :) – YXD Mar 07 '15 at 12:19

1 Answers1

1

A Counter dict will be much simpler if you want to get words that appear the same amount of times:

from collections import Counter

c = Counter(["foo","foo","bar","bar","foobar","foob"])
print([k for k, v in c.items() if v == 2 ])
['foo', 'bar']

However you calculate the similarity, if it is not frequency simply store common words in your dict and access by key. indexing apart from being much less efficient than using a dict is always going to return the first occurrence.

To store count as key and group words as values:

from collections import defaultdict

d = defaultdict(list)

for k,v in c.items():
    d[v].append(k)
print(d.get(2,"N/A"))
['foo', 'bar']
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321