Dig into code
The width, in your case, is the key parameter in get_features()
, which give different splitted words. The get_features()
in your case will output like:
['aa', 'aa', 'aa']
['aa', 'aa', 'as']
Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:
86f24ba207a4912
86f24ba207a4912
They are the same!
The reason is from simhash algorithm itself. Let's look into the code:
def build_by_features(self, features):
"""
`features` might be a list of unweighted tokens (a weight of 1
will be assumed), a list of (token, weight) tuples or
a token -> weight dict.
"""
v = [0] * self.f
masks = [1 << i for i in range(self.f)]
if isinstance(features, dict):
features = features.items()
for f in features:
if isinstance(f, basestring):
h = self.hashfunc(f.encode('utf-8'))
w = 1
else:
assert isinstance(f, collections.Iterable)
h = self.hashfunc(f[0].encode('utf-8'))
w = f[1]
for i in range(self.f):
v[i] += w if h & masks[i] else -w
ans = 0
for i in range(self.f):
if v[i] >= 0:
ans |= masks[i]
self.value = ans
from: leonsim/simhash
The calculation process can be divied into 4 steps:
1) hash each splitted word (feature), to transform string into binary numbers;
2) weight them;
3) assumble weighted bits together;
4) change the assumbled number into binary and output as the value.
Now, in your case, the step 3 will output like:
[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3]
[-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]
And after step 4, the 2 output the same value.
Other parameter
If you change the width from 2 to 1, 3, 4, you will get different result of
Simhash(get_features())
.
Your case shows the limitation of simhash with short length text.