4

I use a a dictionary to represent word count in a article

For example {"name" : 2 , "your": 10, "me", 20} to represent that "name" appears twice, "your" appears 10 times and "me" appears 20 times.

So, is there a good way to calculate the euclidean distance of these vectors? The difficulty is that these vectors are in different lengths and some vectors contains certain words while some do not.

I know I sure can write a long function to do so, just look for a simpler and cleverer way. Thanks

Edit: The objective is to get the similarity between two article and group them

Bear
  • 5,138
  • 5
  • 50
  • 80
  • 2
    Does this [previous answer of mine](http://stackoverflow.com/questions/14720324/compute-the-similarity-between-two-lists/14720386#14720386) provide any help to you? It uses `counter.Counter()`, which is the Python implementation of the bag data structure. – Martijn Pieters May 23 '13 at 12:02
  • You can only do that if both vectors are of the same length (i.e. map the same words), and are in the same order. – Blubber May 23 '13 at 12:03
  • You could compute the euclidean distance on the intersection. Anyway, this ia an arbitrary choice. If you told us exactly your goal we could probably help devising a good distance function for what you want to do. – Bakuriu May 23 '13 at 12:07
  • The question is, how much sense it makes to calculate the euclidian distance for data of different dimensionality. The vector `x=(x1,x2)` is two-dimensional and therefore compareable to a vector `y=(y1,y2)` in terms of euclidian distance. But how would you in this sense compare ` x` to a vector `z = (z1, z2, z3, z4, z5)`? – MaxPowers May 23 '13 at 12:24

2 Answers2

9

Something like

math.sqrt(sum((a[k] - b[k])**2 for k in a.keys()))

Where a and b are dictionaries with the same keys. If you are going to compare these values between different pairs of vectors then you should make sure that each vector contains exactly the same words, otherwise your distance measure is going to mean nothing at all.

You could calculate the distance based on the intersection alone:

math.sqrt(sum((a[k] - b[k])**2 for k in set(a.keys()).intersection(set(b.keys()))))

Another option is to use the union and set unknown values to 0

math.sqrt(sum((a.get(k, 0) - b.get(k, 0))**2 for k in set(a.keys()).union(set(b.keys()))))

But you have to carefully think about what it actually is that you are calculating.

Blubber
  • 2,214
  • 1
  • 17
  • 26
0

You can also use cosine similarity between two vectors as in this link: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/cos.html

G.Ahmed
  • 146
  • 1
  • 1
  • 7