The following is the code which I've written to implement Flajolet and Martin’s Algorithm
. I've used Jenkins hash function
to generate a 32 bit hash value
of data. The program seems to follow the algorithm but is off the mark by about 20%. My data set consists of more than 200,000 unique records whereas the program outputs about 160,000 unique records. Please help me in understanding the mistake(s) being made by me. The hash function is implemented as per Bob Jerkins' website.
import numpy as np
from jenkinshash import jhash
class PCSA():
def __init__(self, nmap, maxlength):
self.nmap = nmap
self.maxlength = maxlength
self.bitmap = np.zeros((nmap, maxlength), dtype=np.int)
def count(self, data):
hashedValue = jhash(data)
indexAlpha = hashedValue % self.nmap
ix = hashedValue / self.nmap
ix = bin(ix)[2:][::-1]
indexBeta = ix.find("1") #find index of lsb
if self.bitmap[indexAlpha, indexBeta] == 0:
self.bitmap[indexAlpha, indexBeta] = 1
def getCardinality(self):
sumIx = 0
for row in range(self.nmap):
sumIx += np.where(self.bitmap[row, :] == 0)[0][0]
A = sumIx / self.nmap
cardinality = self.nmap * (2 ** A)/ MAGIC_CONST
return cardinality